[jira] [Updated] (ARROW-10928) [Python] Unknown error: data type leaf_count mismatch

Lucas da Silva Abreu (Jira) Tue, 15 Dec 2020 11:59:04 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lucas da Silva Abreu updated ARROW-10928:
-----------------------------------------
    Description: 
I was trying to write some dataframes to parquet using {{snappy}} with the 
command

 
{code:java}
// df2.to_parquet('my-parquet', compression='snappy') {code}
 

But I got the following error
 Unknown error: data type leaf_count != builder_leaf_count9 8
 By manually sampling with columns, I found out that a column that is a list of 
dicts was causing the issue

A toy example is shown below which enables one to reproduce the error

 
{code:java}
//  df2 = pd.DataFrame(
 [[
 [{'my_field_1': {},
 'my_field_2':
{'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 
1.0}
,
 'my_field_3': {'my_field_31': 'value_31',
 'my_field_32': 1,
 'my_field_33': 1,
 'my_field_34': 1}},
 {'my_field_1': {},
 'my_field_2':
{'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 
1.0}
,
 'my_field_3': {'my_field_31': 'value_31',
 'my_field_32': 1,
 'my_field_33': 1,
 'my_field_34': 1}}]
 ]], columns = ['my_column'])
 df2['toy_column_1'] = 1
 df2['toy_column_2'] = 'ab'
{code}
Current configuration of my pandas is
{code:java}
INSTALLED VERSIONS
 ------------------
 commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
 python : 3.6.9.final.0
 python-bits : 64
 OS : Linux
 OS-release : 4.15.0-126-generic
 Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
 machine : x86_64
 processor : x86_64
 byteorder : little
 LC_ALL : None
 LANG : en_US.UTF-8
 LOCALE : pt_BR.UTF-8pandas : 1.1.4
 numpy : 1.19.1
 pytz : 2020.1
 dateutil : 2.8.1
 pip : 20.3
 setuptools : 41.2.0
 Cython : None
 pytest : 5.1.1
 hypothesis : None
 sphinx : None
 blosc : None
 feather : None
 xlsxwriter : None
 lxml.etree : None
 html5lib : None
 pymysql : 0.10.1
 psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
 jinja2 : 2.11.2
 IPython : 7.16.1
 pandas_datareader: None
 bs4 : None
 bottleneck : None
 fsspec : None
 fastparquet : 0.4.1
 gcsfs : None
 matplotlib : 3.3.2
 numexpr : None
 odfpy : None
 openpyxl : None
 pandas_gbq : 0.10.0
 pyarrow : 2.0.0
 pytables : None
 pyxlsb : None
 s3fs : None
 scipy : 1.5.2
 sqlalchemy : 1.3.18
 tables : None
 tabulate : 0.8.7
 xarray : None
 xlrd : None
 xlwt : None
 numba : 0.52.0
{code}
 

 
 I have found this issue within pandas 
([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to 
me be the same root cause, but I've noticed that was already using the same 
version of the issue and that the example in the original issue worked fine to 
me.
 Could someone please help me ?
  
 \{{}}

  was:
I was trying to write some dataframes to parquet using {{snappy}} with the 
command

 

 {{}}
{code:java}

{code}
{{[// df2.to_parquet('my-parquet', compression='snappy')|http://df.to/]}}

But I got the following error
 Unknown error: data type leaf_count != builder_leaf_count9 8
 By manually sampling with columns, I found out that a column that is a list of 
dicts was causing the issue

A toy example is shown below which enables one to reproduce the error
 df2 = pd.DataFrame(
 [[
 [\{'my_field_1': {},
 'my_field_2':

{'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 
1.0}

,
 'my_field_3': {'my_field_31': 'value_31',
 'my_field_32': 1,
 'my_field_33': 1,
 'my_field_34': 1}},
 \{'my_field_1': {},
 'my_field_2':

{'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 'my_field_24': 
1.0}

,
 'my_field_3': {'my_field_31': 'value_31',
 'my_field_32': 1,
 'my_field_33': 1,
 'my_field_34': 1}}]
 ]], columns = ['my_column'])
 df2['toy_column_1'] = 1
 df2['toy_column_2'] = 'ab'
 Current configuration of my pandas is
 INSTALLED VERSIONS
 ------------------
 commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
 python : 3.6.9.final.0
 python-bits : 64
 OS : Linux
 OS-release : 4.15.0-126-generic
 Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
 machine : x86_64
 processor : x86_64
 byteorder : little
 LC_ALL : None
 LANG : en_US.UTF-8
 LOCALE : pt_BR.UTF-8pandas : 1.1.4
 numpy : 1.19.1
 pytz : 2020.1
 dateutil : 2.8.1
 pip : 20.3
 setuptools : 41.2.0
 Cython : None
 pytest : 5.1.1
 hypothesis : None
 sphinx : None
 blosc : None
 feather : None
 xlsxwriter : None
 lxml.etree : None
 html5lib : None
 pymysql : 0.10.1
 psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
 jinja2 : 2.11.2
 IPython : 7.16.1
 pandas_datareader: None
 bs4 : None
 bottleneck : None
 fsspec : None
 fastparquet : 0.4.1
 gcsfs : None
 matplotlib : 3.3.2
 numexpr : None
 odfpy : None
 openpyxl : None
 pandas_gbq : 0.10.0
 pyarrow : 2.0.0
 pytables : None
 pyxlsb : None
 s3fs : None
 scipy : 1.5.2
 sqlalchemy : 1.3.18
 tables : None
 tabulate : 0.8.7
 xarray : None
 xlrd : None
 xlwt : None
 numba : 0.52.0
  
 I have found this issue within pandas 
([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to 
me be the same root cause, but I've noticed that was already using the same 
version of the issue and that the example in the original issue worked fine to 
me.
 Could someone please help me ?
  
 \{{}}


> [Python] Unknown error: data type leaf_count mismatch
> -----------------------------------------------------
>
>                 Key: ARROW-10928
>                 URL: https://issues.apache.org/jira/browse/ARROW-10928
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: ubuntu 18.04
>            Reporter: Lucas da Silva Abreu
>            Priority: Blocker
>
> I was trying to write some dataframes to parquet using {{snappy}} with the 
> command
>  
> {code:java}
> // df2.to_parquet('my-parquet', compression='snappy') {code}
>  
> But I got the following error
>  Unknown error: data type leaf_count != builder_leaf_count9 8
>  By manually sampling with columns, I found out that a column that is a list 
> of dicts was causing the issue
> A toy example is shown below which enables one to reproduce the error
>  
> {code:java}
> //  df2 = pd.DataFrame(
>  [[
>  [{'my_field_1': {},
>  'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 
> 'my_field_24': 1.0}
> ,
>  'my_field_3': {'my_field_31': 'value_31',
>  'my_field_32': 1,
>  'my_field_33': 1,
>  'my_field_34': 1}},
>  {'my_field_1': {},
>  'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 
> 'my_field_24': 1.0}
> ,
>  'my_field_3': {'my_field_31': 'value_31',
>  'my_field_32': 1,
>  'my_field_33': 1,
>  'my_field_34': 1}}]
>  ]], columns = ['my_column'])
>  df2['toy_column_1'] = 1
>  df2['toy_column_2'] = 'ab'
> {code}
> Current configuration of my pandas is
> {code:java}
> INSTALLED VERSIONS
>  ------------------
>  commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
>  python : 3.6.9.final.0
>  python-bits : 64
>  OS : Linux
>  OS-release : 4.15.0-126-generic
>  Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
>  machine : x86_64
>  processor : x86_64
>  byteorder : little
>  LC_ALL : None
>  LANG : en_US.UTF-8
>  LOCALE : pt_BR.UTF-8pandas : 1.1.4
>  numpy : 1.19.1
>  pytz : 2020.1
>  dateutil : 2.8.1
>  pip : 20.3
>  setuptools : 41.2.0
>  Cython : None
>  pytest : 5.1.1
>  hypothesis : None
>  sphinx : None
>  blosc : None
>  feather : None
>  xlsxwriter : None
>  lxml.etree : None
>  html5lib : None
>  pymysql : 0.10.1
>  psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
>  jinja2 : 2.11.2
>  IPython : 7.16.1
>  pandas_datareader: None
>  bs4 : None
>  bottleneck : None
>  fsspec : None
>  fastparquet : 0.4.1
>  gcsfs : None
>  matplotlib : 3.3.2
>  numexpr : None
>  odfpy : None
>  openpyxl : None
>  pandas_gbq : 0.10.0
>  pyarrow : 2.0.0
>  pytables : None
>  pyxlsb : None
>  s3fs : None
>  scipy : 1.5.2
>  sqlalchemy : 1.3.18
>  tables : None
>  tabulate : 0.8.7
>  xarray : None
>  xlrd : None
>  xlwt : None
>  numba : 0.52.0
> {code}
>  
>  
>  I have found this issue within pandas 
> ([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to 
> me be the same root cause, but I've noticed that was already using the same 
> version of the issue and that the example in the original issue worked fine 
> to me.
>  Could someone please help me ?
>   
>  \{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10928) [Python] Unknown error: data type leaf_count mismatch

Reply via email to