Re: Python Dataframe API issue

Xinyu Liu Thu, 25 Mar 2021 18:04:28 -0700

Np, thanks for quickly identifying the fix.

Btw, I am very happy about Beam Python supporting the same Pandas dataframe
api. It's super user-friendly to both devs and data scientists. Really cool
work!


Thanks,
Xinyu

On Thu, Mar 25, 2021 at 4:53 PM Robert Bradshaw <rober...@google.com> wrote:

> Thanks, Xinyu, for finding this!
>
> On Thu, Mar 25, 2021 at 4:48 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> Cloned to https://issues.apache.org/jira/browse/BEAM-12056
>>
>> On Thu, Mar 25, 2021 at 4:46 PM Brian Hulette <bhule...@google.com>
>> wrote:
>>
>>> Yes this looks like https://issues.apache.org/jira/browse/BEAM-11929, I
>>> removed it from the release blockers since there is a workaround (use a
>>> NamedTuple type), but it's probably worth cherrypicking the fix.
>>>
>>> On Thu, Mar 25, 2021 at 4:44 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> This could be https://issues.apache.org/jira/browse/BEAM-11929
>>>>
>>>> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>>
>>>>> This is definitely wrong. Looking into what's going on here, but this
>>>>> seems severe enough to be a blocker for the next release.
>>>>>
>>>>> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <xinyuliu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, folks,
>>>>>>
>>>>>> I am playing around with the Python Dataframe API, and seemly got an
>>>>>> schema issue when converting pcollection to dataframe. I wrote the
>>>>>> following code for a simple test:
>>>>>>
>>>>>> import apache_beam as beam
>>>>>> from apache_beam.dataframe.convert import to_dataframe
>>>>>> from apache_beam.dataframe.convert import to_pcollection
>>>>>>
>>>>>> p = beam.Pipeline()
>>>>>> data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map(
>>>>>> lambda x : beam.Row(word=x[0], val=x[1]))
>>>>>> _ = data | beam.Map(print)
>>>>>> p.run()
>>>>>>
>>>>>> This shows the following:
>>>>>> Row(val='1111', word='a') Row(val='2222', word='b')
>>>>>>
>>>>>> But if I use to_dataframe() to convert it into a df, seems the schema
>>>>>> was reversed:
>>>>>>
>>>>>> df = to_dataframe(data)
>>>>>> dataCopy = to_pcollection(df)
>>>>>> _ = dataCopy | beam.Map(print)
>>>>>> p.run()
>>>>>>
>>>>>> I got:
>>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', val='a')
>>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', val='b')
>>>>>>
>>>>>> Seems now the column 'word' and 'val' is swapped. The problem seems
>>>>>> to happen during to_dataframe(). If I print out df['word'], I got '1111'
>>>>>> and '2222'. I am not sure whether I am doing something wrong or there is 
>>>>>> an
>>>>>> issue in the schema conversion. Could someone help me take a look?
>>>>>>
>>>>>> Thanks, Xinyu
>>>>>>
>>>>>

Re: Python Dataframe API issue

Reply via email to