Brian Hulette created BEAM-12502:
------------------------------------
Summary: ib.collect fails to materialize named DeferredDataFrame
instances
Key: BEAM-12502
URL: https://issues.apache.org/jira/browse/BEAM-12502
Project: Beam
Issue Type: Bug
Components: sdk-py-core
Affects Versions: 2.30.0
Reporter: Brian Hulette
Assignee: Sam Rohde
Fix For: 2.31.0
In the below example, note that we return an empty dataframe for
{{ib.collect(deferred_df)}}, but {{ib.collect(to_dataframe(rows))}} works as
expected.
{code}
In [1]: import numpy as np
...: import pandas as pd
...:
...: import apache_beam as beam
...: from apache_beam import Create, Map
...: from apache_beam.dataframe.convert import to_dataframe
...: from apache_beam.dataframe.convert import to_pcollection
...: from apache_beam.runners.interactive.interactive_runner import
InteractiveRunner
...: import apache_beam.runners.interactive.interactive_beam as ib
In [2]: birds = [
...: {
...: "name": "American crow",
...: "scientific_name": "Corvus brachyrhynchos",
...: "order": "Passeriformes",
...: "family": "Corvidae"
...: },
...: {
...: "name": "Canada goose",
...: "scientific_name": "Branta canadensis",
...: "order": "Anseriformes",
...: "family": "Anatidae"
...: },
...: {
...: "name": "mallard",
...: "scientific_name": "Anas platyrhynchos",
...: "order": "Anseriformes",
...: "family": "Anatidae"
...: }
...: ]
In [3]: # create an interactive pipeline
...: p = beam.Pipeline(InteractiveRunner())
...:
...:
...: # create some pipeline data and map it to rows
...: rows = (p | "Create elements" >> Create(birds)
...: | "To rows" >> Map(lambda bird: beam.Row(
...: common_name=bird["name"],
...: scientific_name=bird["scientific_name"],
...: order=bird["order"],
...: family=bird["family"])))
WARNING:apache_beam.runners.interactive.interactive_environment:You have
limited Interactive Beam features since your ipython kernel is not connected to
any notebook frontend.
In [4]: ib.collect(rows)
'Processing...'
'Done.'
Out[4]:
common_name scientific_name order family
0 American crow Corvus brachyrhynchos Passeriformes Corvidae
1 Canada goose Branta canadensis Anseriformes Anatidae
2 mallard Anas platyrhynchos Anseriformes Anatidae
In [5]: ib.collect(to_dataframe(rows))
'Processing...'
'Done.'
Out[5]:
common_name scientific_name order family
0 American crow Corvus brachyrhynchos Passeriformes Corvidae
1 Canada goose Branta canadensis Anseriformes Anatidae
2 mallard Anas platyrhynchos Anseriformes Anatidae
In [6]: deferred_df = to_dataframe(rows)
In [7]: ib.collect(deferred_df)
'Processing...'
'Done.'
Out[7]:
Empty DataFrame
Columns: []
Index: []
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
