[GitHub] [beam] pcoet commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

GitBox Thu, 24 Jun 2021 19:44:09 -0700


pcoet commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658434588




##########
File path: website/www/site/content/en/documentation/dsls/dataframes/overview.md
##########
@@ -112,22 +112,3 @@ pc1, pc2 = {'a': pc} | DataframeTransform(lambda a: expr1, 
expr2)
 
 {...} = {a: pc} | DataframeTransform(lambda a: {...})
 {{< /highlight >}}
-
-## Differences from standard Pandas {#differences_from_standard_pandas}
-
-Beam DataFrames are deferred, like the rest of the Beam API. As a result, 
there are some limitations on what you can do with Beam DataFrames, compared to 
the standard Pandas implementation:
-
-* Because all operations are deferred, the result of a given operation may not 
be available for control flow. For example, you can compute a sum, but you 
can't branch on the result.
-* Result columns must be computable without access to the data. For example, 
you can’t use 
[transpose](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html).
-* PCollections in Beam are inherently unordered, so Pandas operations that are 
sensitive to the ordering of rows are unsupported. For example, order-sensitive 
operations such as 
[shift](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html),
 
[cummax](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cummax.html),
 
[cummin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cummin.html),
 
[head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html),
 and 
[tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail)
 are not supported.
-
-With Beam DataFrames, computation doesn’t take place until the pipeline runs. 
Before that, only the shape or schema of the result is known, meaning that you 
can work with the names and types of the columns, but not the result data 
itself.
-
-There are a few common exceptions you may see when attempting to use certain 
Pandas operations:
-
-* **WontImplementError**: Indicates that this operation or argument isn’t 
supported because it’s incompatible with the Beam model. The largest class of 
operations that raise this error are order-sensitive operations.
-* **NotImplementedError**: Indicates this is an operation or argument that 
hasn’t been implemented yet. Many Pandas operations are already available 
through Beam DataFrames, but there’s still a long tail of unimplemented 
operations.
-* **NonParallelOperation**: Indicates that you’re attempting a non-parallel 
operation outside of an `allow_non_parallel_operations` block. Some operations 
don't lend themselves to parallel computation. They can still be used, but must 
be guarded in a `with beam.dataframe.allow_non_parallel_operations(True)` block.
-
-[pydoc_dataframe_transform]: 
https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.transforms.html#apache_beam.dataframe.transforms.DataframeTransform
-[pydoc_sql_transform]: 
https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform

Review comment:
       Ugh. Thanks for catching!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] pcoet commented on a change in pull request #15074: [BEAM-11951] added "Differences from Pandas" page for DataFrame

Reply via email to