Re: Python Dataframe API issue

Robert Bradshaw Thu, 25 Mar 2021 16:53:28 -0700

Thanks, Xinyu, for finding this!

On Thu, Mar 25, 2021 at 4:48 PM Kenneth Knowles <[email protected]> wrote:


> Cloned to https://issues.apache.org/jira/browse/BEAM-12056
>
> On Thu, Mar 25, 2021 at 4:46 PM Brian Hulette <[email protected]> wrote:
>
>> Yes this looks like https://issues.apache.org/jira/browse/BEAM-11929, I
>> removed it from the release blockers since there is a workaround (use a
>> NamedTuple type), but it's probably worth cherrypicking the fix.
>>
>> On Thu, Mar 25, 2021 at 4:44 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> This could be https://issues.apache.org/jira/browse/BEAM-11929
>>>
>>> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> This is definitely wrong. Looking into what's going on here, but this
>>>> seems severe enough to be a blocker for the next release.
>>>>
>>>> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi, folks,
>>>>>
>>>>> I am playing around with the Python Dataframe API, and seemly got an
>>>>> schema issue when converting pcollection to dataframe. I wrote the
>>>>> following code for a simple test:
>>>>>
>>>>> import apache_beam as beam
>>>>> from apache_beam.dataframe.convert import to_dataframe
>>>>> from apache_beam.dataframe.convert import to_pcollection
>>>>>
>>>>> p = beam.Pipeline()
>>>>> data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map(
>>>>> lambda x : beam.Row(word=x[0], val=x[1]))
>>>>> _ = data | beam.Map(print)
>>>>> p.run()
>>>>>
>>>>> This shows the following:
>>>>> Row(val='1111', word='a') Row(val='2222', word='b')
>>>>>
>>>>> But if I use to_dataframe() to convert it into a df, seems the schema
>>>>> was reversed:
>>>>>
>>>>> df = to_dataframe(data)
>>>>> dataCopy = to_pcollection(df)
>>>>> _ = dataCopy | beam.Map(print)
>>>>> p.run()
>>>>>
>>>>> I got:
>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', val='a')
>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', val='b')
>>>>>
>>>>> Seems now the column 'word' and 'val' is swapped. The problem seems to
>>>>> happen during to_dataframe(). If I print out df['word'], I got '1111' and
>>>>> '2222'. I am not sure whether I am doing something wrong or there is an
>>>>> issue in the schema conversion. Could someone help me take a look?
>>>>>
>>>>> Thanks, Xinyu
>>>>>
>>>>

Re: Python Dataframe API issue

Reply via email to