damccorm commented on code in PR #30085:
URL: https://github.com/apache/beam/pull/30085#discussion_r1463988525
##########
sdks/python/apache_beam/ml/transforms/base_test.py:
##########
@@ -443,6 +442,26 @@ def test_handler_on_multiple_columns(self):
equal_to(expected_data),
)
+ def test_handler_on_columns_not_exist_in_input_data(self):
+ data = [
+ {
+ 'x': "Hello world", 'y': "Apache Beam"
Review Comment:
One case I'm worried about is if you have:
```
data = [
{
'x': 'foo', 'y': 'bar', 'z': 'baz'
},
{
'x': 'foo2', 'y': 'bar2'
}
]
```
What happens in that case? I think we don't throw in the same place, though
maybe we throw in
https://github.com/apache/beam/blob/8b7b853ec9660fccfccca54e77a43fd37000959c/sdks/python/apache_beam/ml/transforms/base.py#L84
- that's not a super informative error message for a user though (its not
clear which record was missing the column, and the error probably doesn't
really give them enough info).
We might actually want to be a little more sophisticated in
https://github.com/apache/beam/blob/8b7b853ec9660fccfccca54e77a43fd37000959c/sdks/python/apache_beam/ml/transforms/base.py#L71
and:
1) Add the first record to the dictionary
2) For each subsequent record, before adding it to the dictionary check that
it contains all the same fields and doesn't contain extra fields. Throw if it
doesn't satisfy these conditions, else add it to the dictionary.
Then we can still do the same error check that you added. That will allow us
to throw an error that clearly calls out which records have different values
present though.
##########
sdks/python/apache_beam/ml/transforms/base_test.py:
##########
@@ -443,6 +442,26 @@ def test_handler_on_multiple_columns(self):
equal_to(expected_data),
)
+ def test_handler_on_columns_not_exist_in_input_data(self):
+ data = [
+ {
+ 'x': "Hello world", 'y': "Apache Beam"
Review Comment:
> For each subsequent record, before adding it to the dictionary check that
it contains all the same fields and doesn't contain extra fields. Throw if it
doesn't satisfy these conditions, else add it to the dictionary.
Alternately, we could have some sort of dummy value that we use as a
placeholder for empty fields and only throw if we're performing the operation.
That would probably be more work than we want to do right before the release
cut, but the idea would basically be given data:
```
data = [
{
'x': 'foo', 'y': 'bar', 'z': 'baz'
},
{
'x': 'foo2', 'y': 'bar2'
}
]
```
`MLTransform(...).with_transform(['x', 'y'])` doesn't need to throw, it
would return:
```
{
'x': fn('foo'), 'y': fn('bar'), 'z': 'baz'
},
{
'x': fn('foo2'), 'y': fn('bar2')
}
]
```
but `MLTransform(...).with_transform(['x', 'y', 'z'])` would throw because
it would want to apply `fn(record.z)` on the second record.
My vote would be to just tighten the error checking as mentioned in my first
comment, but its something we could consider.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]