[
https://issues.apache.org/jira/browse/PARQUET-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162414#comment-16162414
]
Wes McKinney commented on PARQUET-1100:
---------------------------------------
[~xhochy] it appears that the logic in
{{TypedColumnReader<T>::ReadBatchSpaced}} is broken for some cases of repeated
types (the records -- repetition level 0) are not being properly delimited.
Most of our nested types code eventually needs to be rewritten to account for
the general case of arbitrary nested types. We need to fix at least this
particular issue before 1.3.0 goes out.
> [C++] Reading repeated types should decode number of records rather than
> number of values
> -----------------------------------------------------------------------------------------
>
> Key: PARQUET-1100
> URL: https://issues.apache.org/jira/browse/PARQUET-1100
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Affects Versions: cpp-1.2.0
> Reporter: Jarno Seppanen
> Fix For: cpp-1.3.0
>
> Attachments:
> part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet
>
>
> Reading the attached parquet file into pandas dataframe and then using the
> dataframe segfaults.
> {noformat}
> Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 11:58:13)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> >>> import pyarrow
> >>> import pyarrow.parquet as pq
> >>> pyarrow.__version__
> '0.6.0'
> >>> import pandas as pd
> >>> pd.__version__
> '0.19.0'
> >>> df =
> >>> pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet')
> >>> \
> ... .to_pandas()
> >>> len(df)
> 69
> >>> df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 69 entries, 0 to 68
> Data columns (total 6 columns):
> label 69 non-null int32
> account_meta 69 non-null object
> features_type 69 non-null int32
> features_size 69 non-null int32
> features_indices 1 non-null object
> features_values 1 non-null object
> dtypes: int32(3), object(3)
> memory usage: 2.5+ KB
> >>>
> >>> pd.concat([df, df])
> Segmentation fault (core dumped)
> {noformat}
> Actually just print(df) is enough to trigger the segfault
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)