damccorm commented on code in PR #30085:
URL: https://github.com/apache/beam/pull/30085#discussion_r1463988525


##########
sdks/python/apache_beam/ml/transforms/base_test.py:
##########
@@ -443,6 +442,26 @@ def test_handler_on_multiple_columns(self):
           equal_to(expected_data),
       )
 
+  def test_handler_on_columns_not_exist_in_input_data(self):
+    data = [
+        {
+            'x': "Hello world", 'y': "Apache Beam"

Review Comment:
   One case I'm worried about is if you have:
   
   ```
   data = [
   {
      'x': 'foo', 'y': 'bar', 'z': 'baz'
   },
   {
      'x': 'foo2', 'y': 'bar2'
   }
   ]
   ```
   
   What happens in that case? I think we don't throw in the same place, though 
maybe we throw in 
https://github.com/apache/beam/blob/8b7b853ec9660fccfccca54e77a43fd37000959c/sdks/python/apache_beam/ml/transforms/base.py#L84
 - that's not a super informative error message for a user though (its not 
clear which record was missing the column, and the error probably doesn't 
really give them enough info).
   
   We might actually want to be a little more sophisticated in 
https://github.com/apache/beam/blob/8b7b853ec9660fccfccca54e77a43fd37000959c/sdks/python/apache_beam/ml/transforms/base.py#L71
 and:
   1) Add the first record to the dictionary
   2) For each subsequent record, before adding it to the dictionary check that 
it contains all the same fields and doesn't contain extra fields. Throw if it 
doesn't satisfy these conditions, else add it to the dictionary.
   
   Then we can still do the same error check that you added. That will allow us 
to throw an error that clearly calls out which records have different values 
present though.



##########
sdks/python/apache_beam/ml/transforms/base_test.py:
##########
@@ -443,6 +442,26 @@ def test_handler_on_multiple_columns(self):
           equal_to(expected_data),
       )
 
+  def test_handler_on_columns_not_exist_in_input_data(self):
+    data = [
+        {
+            'x': "Hello world", 'y': "Apache Beam"

Review Comment:
   > For each subsequent record, before adding it to the dictionary check that 
it contains all the same fields and doesn't contain extra fields. Throw if it 
doesn't satisfy these conditions, else add it to the dictionary.
   
   Alternately, we could have some sort of dummy value that we use as a 
placeholder for empty fields and only throw if we're performing the operation. 
That would probably be more work than we want to do right before the release 
cut, but the idea would basically be given data:
   
   ```
   data = [
   {
      'x': 'foo', 'y': 'bar', 'z': 'baz'
   },
   {
      'x': 'foo2', 'y': 'bar2'
   }
   ]
   ```
   
   `MLTransform(...).with_transform(['x', 'y'])` doesn't need to throw, it 
would return:
   
   ```
   {
      'x': fn('foo'), 'y': fn('bar'), 'z': 'baz'
   },
   {
      'x': fn('foo2'), 'y': fn('bar2')
   }
   ]
   ```
   
   but `MLTransform(...).with_transform(['x', 'y', 'z'])` would throw because 
it would want to apply `fn(record.z)` on the second record.
   
   My vote would be to just tighten the error checking as mentioned in my first 
comment, but its something we could consider.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to