[jira] [Commented] (ARROW-10928) [C++][Parquet] Unknown error: data type leaf_count mismatch

Lucas da Silva Abreu (Jira) Wed, 16 Dec 2020 05:34:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250315#comment-17250315
 ]


Lucas da Silva Abreu commented on ARROW-10928:
----------------------------------------------

Hi, [~jorisvandenbossche], glad to help !
If I can help you in anyway, let me know

> [C++][Parquet] Unknown error: data type leaf_count mismatch
> -----------------------------------------------------------
>
>                 Key: ARROW-10928
>                 URL: https://issues.apache.org/jira/browse/ARROW-10928
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: ubuntu 18.04
>            Reporter: Lucas da Silva Abreu
>            Priority: Blocker
>             Fix For: 3.0.0
>
>
> I was trying to write some dataframes to parquet using {{snappy}} with the 
> command
>  
> {code:java}
> df2.to_parquet('my-parquet', compression='snappy') {code}
>  
> But I got the following error
>  Unknown error: data type leaf_count != builder_leaf_count9 8
>  By manually sampling with columns, I found out that a column that is a list 
> of dicts was causing the issue
> A toy example is shown below which enables one to reproduce the error
>  
> {code:java}
> df2 = pd.DataFrame(
>  [[
>  [{'my_field_1': {},
>  'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 
> 'my_field_24': 1.0}
> ,
>  'my_field_3': {'my_field_31': 'value_31',
>  'my_field_32': 1,
>  'my_field_33': 1,
>  'my_field_34': 1}},
>  {'my_field_1': {},
>  'my_field_2':
> {'my_field_21': 'value_21', 'my_field_22': 1, 'my_field_23': 1, 
> 'my_field_24': 1.0}
> ,
>  'my_field_3': {'my_field_31': 'value_31',
>  'my_field_32': 1,
>  'my_field_33': 1,
>  'my_field_34': 1}}]
>  ]], columns = ['my_column'])
>  df2['toy_column_1'] = 1
>  df2['toy_column_2'] = 'ab'
> {code}
> Current configuration of my pandas is
> {code:java}
> INSTALLED VERSIONS
>  ------------------
>  commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
>  python : 3.6.9.final.0
>  python-bits : 64
>  OS : Linux
>  OS-release : 4.15.0-126-generic
>  Version : #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020
>  machine : x86_64
>  processor : x86_64
>  byteorder : little
>  LC_ALL : None
>  LANG : en_US.UTF-8
>  LOCALE : pt_BR.UTF-8pandas : 1.1.4
>  numpy : 1.19.1
>  pytz : 2020.1
>  dateutil : 2.8.1
>  pip : 20.3
>  setuptools : 41.2.0
>  Cython : None
>  pytest : 5.1.1
>  hypothesis : None
>  sphinx : None
>  blosc : None
>  feather : None
>  xlsxwriter : None
>  lxml.etree : None
>  html5lib : None
>  pymysql : 0.10.1
>  psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
>  jinja2 : 2.11.2
>  IPython : 7.16.1
>  pandas_datareader: None
>  bs4 : None
>  bottleneck : None
>  fsspec : None
>  fastparquet : 0.4.1
>  gcsfs : None
>  matplotlib : 3.3.2
>  numexpr : None
>  odfpy : None
>  openpyxl : None
>  pandas_gbq : 0.10.0
>  pyarrow : 2.0.0
>  pytables : None
>  pyxlsb : None
>  s3fs : None
>  scipy : 1.5.2
>  sqlalchemy : 1.3.18
>  tables : None
>  tabulate : 0.8.7
>  xarray : None
>  xlrd : None
>  xlwt : None
>  numba : 0.52.0
> {code}
>  
>  
>  I have found this issue within pandas 
> ([https://github.com/pandas-dev/pandas/issues/34643]) that at first seemed to 
> me be the same root cause, but I've noticed that was already using the same 
> version of the issue and that the example in the original issue worked fine 
> to me.
>  Could someone ,please, help me ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10928) [C++][Parquet] Unknown error: data type leaf_count mismatch

Reply via email to