[jira] [Commented] (PARQUET-1100) [C++] Reading repeated types should decode number of records rather than number of values

Wes McKinney (JIRA) Mon, 11 Sep 2017 19:49:18 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162414#comment-16162414
 ]


Wes McKinney commented on PARQUET-1100:
---------------------------------------

[~xhochy] it appears that the logic in 
{{TypedColumnReader<T>::ReadBatchSpaced}} is broken for some cases of repeated 
types (the records  -- repetition level 0) are not being properly delimited. 
Most of our nested types code eventually needs to be rewritten to account for 
the general case of arbitrary nested types. We need to fix at least this 
particular issue before 1.3.0 goes out. 

> [C++] Reading repeated types should decode number of records rather than 
> number of values
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1100
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1100
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.2.0
>            Reporter: Jarno Seppanen
>             Fix For: cpp-1.3.0
>
>         Attachments: 
> part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet
>
>
> Reading the attached parquet file into pandas dataframe and then using the 
> dataframe segfaults.
> {noformat}
> Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 
> >>> import pyarrow
> >>> import pyarrow.parquet as pq
> >>> pyarrow.__version__
> '0.6.0'
> >>> import pandas as pd
> >>> pd.__version__
> '0.19.0'
> >>> df = 
> >>> pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet')
> >>>  \
> ...        .to_pandas()
> >>> len(df)
> 69
> >>> df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 69 entries, 0 to 68
> Data columns (total 6 columns):
> label               69 non-null int32
> account_meta        69 non-null object
> features_type       69 non-null int32
> features_size       69 non-null int32
> features_indices    1 non-null object
> features_values     1 non-null object
> dtypes: int32(3), object(3)
> memory usage: 2.5+ KB
> >>> 
> >>> pd.concat([df, df])
> Segmentation fault (core dumped)
> {noformat}
> Actually just print(df) is enough to trigger the segfault



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (PARQUET-1100) [C++] Reading repeated types should decode number of records rather than number of values

Reply via email to