TheNeuralBit commented on a change in pull request #14853:
URL: https://github.com/apache/beam/pull/14853#discussion_r647858103



##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -122,7 +123,12 @@ def _run_test(self, func, *args, distributed=True, 
nonparallel=False):
             generated twice, once outside of an allow_non_parallel_operations
             block (to verify NonParallelOperation is raised), and again inside
             of an allow_non_parallel_operations block to actually generate an
-            expression to verify."""
+            expression to verify.
+        check_proxy (bool): Whether or not to check that the proxy of the
+            generated expression matches the actual result, defaults to True.

Review comment:
       So proxies are used for tracking data types at pipeline construction 
time. We generate them for every expression in the DataFrame expression tree, 
sometimes we construct them manually, but usually we "compute" them by calling 
the expression's function with an empty input - the proxies from the input 
expressions. This is nice because it leverages pandas' input validation e.g. 
pandas will raise an error in proxy generation if the user tries to get the 
mean() of a non-numeric column,  or tries to groupby() a column that doesn't 
exist.
   
   So the impact of having a mismatched proxy is that we may not properly 
validate the expression. We could allow an expression that will fail at 
execution time, or disallow an expression that would have worked at execution 
time.
   
   We also use the proxies for converting back to Beam types in 
`to_pcollection`. So a mismatched proxy there would make us have the wrong 
types in the output PCollection's schema.
   
   I think these mismatches are mostly just due to having the wrong numeric 
type, so they should be harmless




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to