[
https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618804#comment-17618804
]
Yibo Cai commented on ARROW-17983:
----------------------------------
cc [[email protected]] for comments.
> [Parquet][C++][Python] "List index overflow" when read parquet file
> -------------------------------------------------------------------
>
> Key: ARROW-17983
> URL: https://issues.apache.org/jira/browse/ARROW-17983
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet, Python
> Reporter: Yibo Cai
> Priority: Major
>
> From issue https://github.com/apache/arrow/issues/14229.
> The bug looks like this:
> - create a pandas dataframe with *one column* and {{n}} rows, {{n <
> max(int32)}}
> - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
> - save to a parquet file
> - reading from the parquet file fails with "OSError: List index overflow"
> See comment below on details to reproudce this bug:
> https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
> Tested with a small dataset, the error might come from below code.
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
> {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is
> incremented) {{m * n}} times which is beyond {{max(int32)}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)