[
https://issues.apache.org/jira/browse/BEAM-12502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brian Hulette updated BEAM-12502:
---------------------------------
Affects Version/s: 2.29.0
> ib.collect fails to materialize named DeferredDataFrame instances
> -----------------------------------------------------------------
>
> Key: BEAM-12502
> URL: https://issues.apache.org/jira/browse/BEAM-12502
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Affects Versions: 2.29.0, 2.30.0
> Reporter: Brian Hulette
> Assignee: Sam Rohde
> Priority: P2
> Labels: dataframe-api
> Fix For: 2.31.0
>
>
> In the below example, note that we return an empty dataframe for
> {{ib.collect(deferred_df)}}, but {{ib.collect(to_dataframe(rows))}} works as
> expected.
> {code}
> In [1]: import numpy as np
>
> ...: import pandas as pd
>
> ...:
> ...: import apache_beam as beam
>
>
> ...: from apache_beam import Create, Map
>
>
> ...: from apache_beam.dataframe.convert import to_dataframe
>
>
> ...: from apache_beam.dataframe.convert import to_pcollection
>
>
> ...: from apache_beam.runners.interactive.interactive_runner import
> InteractiveRunner
>
>
> ...: import apache_beam.runners.interactive.interactive_beam as ib
>
>
>
>
>
> In [2]: birds = [
>
>
> ...: {
>
>
> ...: "name": "American crow",
>
>
> ...: "scientific_name": "Corvus brachyrhynchos",
>
>
> ...: "order": "Passeriformes",
>
>
> ...: "family": "Corvidae"
>
>
> ...: },
>
>
> ...: {
>
>
> ...: "name": "Canada goose",
>
>
> ...: "scientific_name": "Branta canadensis",
>
>
> ...: "order": "Anseriformes",
>
>
> ...: "family": "Anatidae"
>
>
> ...: },
>
>
> ...: {
>
>
> ...: "name": "mallard",
>
>
> ...: "scientific_name": "Anas platyrhynchos",
>
>
> ...: "order": "Anseriformes",
>
>
> ...: "family": "Anatidae"
>
>
> ...: }
>
>
> ...: ]
>
>
>
>
>
> In [3]: # create an interactive pipeline
>
>
> ...: p = beam.Pipeline(InteractiveRunner())
>
>
> ...:
>
>
> ...:
>
>
> ...: # create some pipeline data and map it to rows
>
>
> ...: rows = (p | "Create elements" >> Create(birds)
>
>
> ...: | "To rows" >> Map(lambda bird: beam.Row(
>
> ...: common_name=bird["name"],
> ...: scientific_name=bird["scientific_name"],
>
> ...: order=bird["order"],
> ...: family=bird["family"])))
> WARNING:apache_beam.runners.interactive.interactive_environment:You have
> limited Interactive Beam features since your ipython kernel is not connected
> to any notebook frontend.
> In [4]: ib.collect(rows)
>
>
> 'Processing...'
>
>
> 'Done.'
>
>
> Out[4]:
>
>
> common_name scientific_name order family
>
>
> 0 American crow Corvus brachyrhynchos Passeriformes Corvidae
>
>
> 1 Canada goose Branta canadensis Anseriformes Anatidae
>
>
> 2 mallard Anas platyrhynchos Anseriformes Anatidae
>
>
>
>
>
> In [5]: ib.collect(to_dataframe(rows))
>
>
> 'Processing...'
>
>
> 'Done.'
>
>
> Out[5]:
>
>
> common_name scientific_name order family
>
>
> 0 American crow Corvus brachyrhynchos Passeriformes Corvidae
>
>
> 1 Canada goose Branta canadensis Anseriformes Anatidae
>
>
> 2 mallard Anas platyrhynchos Anseriformes Anatidae
>
>
>
>
>
> In [6]: deferred_df = to_dataframe(rows)
>
>
>
>
>
> In [7]: ib.collect(deferred_df)
>
>
> 'Processing...'
>
>
> 'Done.'
>
>
> Out[7]:
>
>
> Empty DataFrame
>
>
> Columns: []
>
>
> Index: []
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
