[jira] [Work logged] (BEAM-11951) Add documentation page highlighting differences from standard pandas

ASF GitHub Bot (Jira) Thu, 24 Jun 2021 10:15:05 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-11951?focusedWorklogId=614628&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-614628
 ]


ASF GitHub Bot logged work on BEAM-11951:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Jun/21 17:14
            Start Date: 24/Jun/21 17:14
    Worklog Time Spent: 10m 
      Work Description: pcoet commented on a change in pull request #15074:
URL: https://github.com/apache/beam/pull/15074#discussion_r658137020



##########
File path: 
website/www/site/content/en/documentation/dsls/dataframes/differences-from-pandas.md
##########
@@ -0,0 +1,85 @@
+---
+type: languages
+title: "Differences from Pandas"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Differences from Pandas
+
+The Apache Beam DataFrame API aims to be a drop-in replacement for Pandas 
DataFrame, but there are a few differences to be aware of. The Beam DataFrame 
API is adapted for deferred processing, and Beam doesn’t implement all of the 
Pandas DataFrame operations.
+
+This page describes divergences between the Beam and Pandas APIs and provides 
tips for working with the Beam DataFrame API.
+
+## Working with Pandas sources
+
+Beam operations are always associated with a pipeline. To read source data 
into a Beam DataFrame, you have to apply the source to a pipeline object. For 
example, to read input from a CSV file, you could use 
[read_csv](https://beam.apache.org/releases/pydoc/{{< param release_latest 
>}}/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv) as follows:
+
+    df = p | beam.dataframe.io.read_csv(...)
+
+This is similar to Pandas 
[read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html),
 but `df` is a deferred Beam DataFrame representing the contents of the file. 
The input filename can be any file pattern understood by 
[fileio.MatchFiles](https://beam.apache.org/releases/pydoc/{{< param 
release_latest >}}/apache_beam.io.fileio.html#apache_beam.io.fileio.MatchFiles).
+
+For an example of using sources and sinks with the DataFrame API, see 
[taxiride.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/dataframe/taxiride.py).
+
+## Non-parallelizable operations
+
+To support distributed processing, Beam invokes DataFrame operations on 
subsets of data in parallel. Some DataFrame operations can’t be parallelized, 
and these operations raise a 
[NonParallelOperation](https://beam.apache.org/releases/pydoc/{{< param 
release_latest 
>}}/apache_beam.dataframe.expressions.html#apache_beam.dataframe.expressions.NonParallelOperation)
 error by default.
+
+**Workaround**
+
+If you want to use a non-parallelizable operation, you have to guard it with a 
`beam.dataframe.allow_non_parallel_operations(True)` block. But note that this 
collects the entire input dataset on a single node, so there’s a risk of 
running out of memory. You should only use this workaround if you’re sure that 
the input is small enough to process on a single worker.
+
+## Operations that produce non-deferred columns
+
+Beam DataFrame operations are deferred, but the schemas of the resulting 
DataFrames are not, meaning that result columns must be computable without 
access to the data. Some DataFrame operations can’t support this usage, so they 
can’t be implemented. These operations raise a 
[WontImplementError](https://beam.apache.org/releases/pydoc/{{< param 
release_latest 
>}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Currently there’s no workaround for this issue. But in the future, Beam 
Dataframe may support non-deferred column operations on categorical columns. 
This work is being tracked in 
[BEAM-12169](https://issues.apache.org/jira/browse/BEAM-12169).
+
+## Operations that produce non-deferred values or plots
+
+Because Beam operations are deferred, it’s infeasible to implement DataFrame 
APIs that produce non-deferred values or plots. If invoked, these operations 
raise a [WontImplementError](https://beam.apache.org/releases/pydoc/{{< param 
release_latest 
>}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< 
param release_latest 
>}}/apache_beam.runners.interactive.interactive_beam.html), you can use 
`collect` to bring a dataset into local memory and then perform these 
operations.
+
+You can also use [to_pcollection](https://beam.apache.org/releases/pydoc/{{< 
param release_latest 
>}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection)
 to convert a deferred DataFrame to a PCollection, and you can use 
[to_dataframe](https://beam.apache.org/releases/pydoc/{{< param release_latest 
>}}/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe)
 to convert a PCollection to a deferred DataFrame. These methods provide 
additional flexibility in working around operations that aren’t implemented.
+
+## Order-sensitive operations
+
+Beam PCollections are inherently unordered, so Pandas operations that are 
sensitive to the ordering of rows are not supported. These operations raise a 
[WontImplementError](https://beam.apache.org/releases/pydoc/{{< param 
release_latest 
>}}/apache_beam.dataframe.frame_base.html#apache_beam.dataframe.frame_base.WontImplementError).
+
+Order-sensitive operations may be supported in the future. To track progress 
on this issue, follow 
[BEAM-12129](https://issues.apache.org/jira/browse/BEAM-12129). You can also 
[contact us](https://beam.apache.org/community/contact-us/) to let us know we 
should prioritize this work.
+
+**Workaround**
+
+If you’re using [Interactive Beam](https://beam.apache.org/releases/pydoc/{{< 
param release_latest 
>}}/apache_beam.runners.interactive.interactive_beam.html), you can use 
`collect` to bring a dataset into local memory and then perform these 
operations.
+
+Alternatively, there may be ways to rewrite your code so that it’s not order 
sensitive. For example, Pandas users often call the order-sensitive 
[head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) 
operation to peek at data, but if you just want to view a subset of elements, 
you can also use `sample`, which doesn’t require you to collect the data first. 
Similarly, you could use `nlargest` instead of `sort_values(...).head`.
+
+## Operations that produce deferred scalars
+
+Some DataFrame operations produce deferred scalars. In Beam, actual 
computation of the values is deferred, and so  the values are not available for 
control flow. For example, you can compute a sum with `Series.sum`, but you 
can’t immediately branch on the result, because the result data is not 
immediately available. `Series.is_unique` is a similar example. Using a 
deferred scalar for branching logic or truth tests raises a 
[TypeError](https://github.com/apache/beam/blob/b908f595101ff4f21439f5432514005394163570/sdks/python/apache_beam/dataframe/frame_base.py#L117).
+
+## Operations that aren’t implemented yet
+
+The Beam DataFrame API implements many of the commonly used Pandas DataFrame 
operations, and we’re actively working to support the remaining operations. But 
Pandas has a large API, and there are still gaps 
([BEAM-9547](https://issues.apache.org/jira/browse/BEAM-9547)). If you invoke 
an operation that hasn’t been implemented yet, it will raise a 
`NotImplementedError`. Please [let us 
know](https://beam.apache.org/community/contact-us/) if you encounter a missing 
operation that you think should be prioritized.
+
+## Using Interactive Beam to work with deferred or unordered values
+
+Some Pandas DataFrame operations can’t be implemented in Beam because they 
produce deferred values that are incompatible with the Beam programming model. 
Other operations with deferred results are implemented, but the results aren’t 
available for control flow in the pipeline. A third class of operations can’t 
be implemented because they’re order sensitive, and Beam PCollections are 
unordered. For all these cases, [Interactive 
Beam](https://beam.apache.org/releases/pydoc/{{< param release_latest 
>}}/apache_beam.runners.interactive.interactive_beam.html) can provide 
workarounds.
+
+Interactive Beam is a module designed for use in interactive notebooks. The 
module, which by convention is imported as `ib`, provides an `ib.collect` 
operation that brings a dataset into local memory and makes it available for 
DataFrame operations that are order-sensitive or can’t be deferred.
+

Review comment:
       Good suggestions. Thanks! I integrated your changes, made a few other 
minor tweaks, and pushed another commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 614628)
    Time Spent: 20m  (was: 10m)

> Add documentation page highlighting differences from standard pandas
> --------------------------------------------------------------------
>
>                 Key: BEAM-11951
>                 URL: https://issues.apache.org/jira/browse/BEAM-11951
>             Project: Beam
>          Issue Type: Task
>          Components: dsl-dataframe, website
>            Reporter: Brian Hulette
>            Assignee: David Huntsperger
>            Priority: P2
>              Labels: dataframe-api
>          Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-11951) Add documentation page highlighting differences from standard pandas

Reply via email to