Np, thanks for quickly identifying the fix. Btw, I am very happy about Beam Python supporting the same Pandas dataframe api. It's super user-friendly to both devs and data scientists. Really cool work!
Thanks, Xinyu On Thu, Mar 25, 2021 at 4:53 PM Robert Bradshaw <rober...@google.com> wrote: > Thanks, Xinyu, for finding this! > > On Thu, Mar 25, 2021 at 4:48 PM Kenneth Knowles <k...@apache.org> wrote: > >> Cloned to https://issues.apache.org/jira/browse/BEAM-12056 >> >> On Thu, Mar 25, 2021 at 4:46 PM Brian Hulette <bhule...@google.com> >> wrote: >> >>> Yes this looks like https://issues.apache.org/jira/browse/BEAM-11929, I >>> removed it from the release blockers since there is a workaround (use a >>> NamedTuple type), but it's probably worth cherrypicking the fix. >>> >>> On Thu, Mar 25, 2021 at 4:44 PM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> This could be https://issues.apache.org/jira/browse/BEAM-11929 >>>> >>>> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw <rober...@google.com> >>>> wrote: >>>> >>>>> This is definitely wrong. Looking into what's going on here, but this >>>>> seems severe enough to be a blocker for the next release. >>>>> >>>>> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <xinyuliu...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, folks, >>>>>> >>>>>> I am playing around with the Python Dataframe API, and seemly got an >>>>>> schema issue when converting pcollection to dataframe. I wrote the >>>>>> following code for a simple test: >>>>>> >>>>>> import apache_beam as beam >>>>>> from apache_beam.dataframe.convert import to_dataframe >>>>>> from apache_beam.dataframe.convert import to_pcollection >>>>>> >>>>>> p = beam.Pipeline() >>>>>> data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map( >>>>>> lambda x : beam.Row(word=x[0], val=x[1])) >>>>>> _ = data | beam.Map(print) >>>>>> p.run() >>>>>> >>>>>> This shows the following: >>>>>> Row(val='1111', word='a') Row(val='2222', word='b') >>>>>> >>>>>> But if I use to_dataframe() to convert it into a df, seems the schema >>>>>> was reversed: >>>>>> >>>>>> df = to_dataframe(data) >>>>>> dataCopy = to_pcollection(df) >>>>>> _ = dataCopy | beam.Map(print) >>>>>> p.run() >>>>>> >>>>>> I got: >>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', val='a') >>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', val='b') >>>>>> >>>>>> Seems now the column 'word' and 'val' is swapped. The problem seems >>>>>> to happen during to_dataframe(). If I print out df['word'], I got '1111' >>>>>> and '2222'. I am not sure whether I am doing something wrong or there is >>>>>> an >>>>>> issue in the schema conversion. Could someone help me take a look? >>>>>> >>>>>> Thanks, Xinyu >>>>>> >>>>>