[
https://issues.apache.org/jira/browse/ARROW-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250315#comment-17250315
]
Lucas da Silva Abreu commented on ARROW-10928:
----------------------------------------------
Hi, [~jorisvandenbossche], glad to help !
If I can help you in anyway, let me know
> [C++][Parquet] Unknown error: data type leaf_count mismatch
> -----------------------------------------------------------
>
> Key: ARROW-10928
> URL: https://issues.apache.org/jira/browse/ARROW-10928
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Environment: ubuntu 18.04
> Reporter: Lucas da Silva Abreu
> Priority: Blocker
> Fix For: 3.0.0
>
>
> I was trying to write some dataframes to parquet using {{snappy}} with the
> command
>
> {code:java}
> df2.to_parquet('my-parquet', compression='snappy') {code}
>
> But I got the following error
> Unknown error: data type leaf_count != builder_leaf_count9 8
> By manually sampling with columns, I found out that a column that is a list
> of dicts was causing the issue
> A toy example is shown below which enables one to reproduce the error
>
> {code:java}
> df2 = pd.DataFrame(
> [[
> [{'my_field_1': {},
> 'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1,
> 'my_field_24': 1.0}
> ,
> 'my_field_3': {'my_field_31': 'value_31',
> 'my_field_32': 1,
> 'my_field_33': 1,
> 'my_field_34': 1}},
> {'my_field_1': {},
> 'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1,
> 'my_field_24': 1.0}
> ,
> 'my_field_3': {'my_field_31': 'value_31',
> 'my_field_32': 1,
> 'my_field_33': 1,
> 'my_field_34': 1}}]
> ]], columns = ['my_column'])
> df2['toy_column_1'] = 1
> df2['toy_column_2'] = 'ab'
> {code}
> Current configuration of my pandas is
> {code:java}
> INSTALLED VERSIONS
> ------------------
> commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
> python : 3.6.9.final.0
> python-bits : 64
> OS : Linux
> OS-release : 4.15.0-126-generic
> Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
> machine : x86_64
> processor : x86_64
> byteorder : little
> LC_ALL : None
> LANG : en_US.UTF-8
> LOCALE : pt_BR.UTF-8pandas : 1.1.4
> numpy : 1.19.1
> pytz : 2020.1
> dateutil : 2.8.1
> pip : 20.3
> setuptools : 41.2.0
> Cython : None
> pytest : 5.1.1
> hypothesis : None
> sphinx : None
> blosc : None
> feather : None
> xlsxwriter : None
> lxml.etree : None
> html5lib : None
> pymysql : 0.10.1
> psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
> jinja2 : 2.11.2
> IPython : 7.16.1
> pandas_datareader: None
> bs4 : None
> bottleneck : None
> fsspec : None
> fastparquet : 0.4.1
> gcsfs : None
> matplotlib : 3.3.2
> numexpr : None
> odfpy : None
> openpyxl : None
> pandas_gbq : 0.10.0
> pyarrow : 2.0.0
> pytables : None
> pyxlsb : None
> s3fs : None
> scipy : 1.5.2
> sqlalchemy : 1.3.18
> tables : None
> tabulate : 0.8.7
> xarray : None
> xlrd : None
> xlwt : None
> numba : 0.52.0
> {code}
>
>
> I have found this issue within pandas
> ([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to
> me be the same root cause, but I've noticed that was already using the same
> version of the issue and that the example in the original issue worked fine
> to me.
> Could someone ,please, help me ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)