[jira] [Commented] (PARQUET-1100) [C++] Reading repeated types should decode number of records rather than number of values

Itai Incze (JIRA) Sat, 16 Sep 2017 08:34:07 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168957#comment-16168957
 ]


Itai Incze commented on PARQUET-1100:
-------------------------------------

[~wesmckinn] - I have a suggestion for where I think the bug seem to originate 
from - or at least the wrong length bug:
in {{FileReader::Impl::ReadSchemaField:433}} the calculation for {{batch_size}} 
seems wrong - it is using the index for the top level schema field instead of 
the one to the leaf column. This works for the flat case but not when there's a 
(leading) struct.
Seems to me one should find the proper leaf column. (I guess that when dealing 
with structs since they currently do not contain repetition it doesn't matter 
which column is taken). I think I can offer a simple patch tomorrow if needed.

Nevertheless, I believe this refactoring is very good and a necessary one (that 
is, having a stateful reader with an array builder) which fixes some problems I 
ran into while trying to implement arbitrary nested reading.



> [C++] Reading repeated types should decode number of records rather than 
> number of values
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1100
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1100
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.2.0
>            Reporter: Jarno Seppanen
>            Assignee: Wes McKinney
>             Fix For: cpp-1.3.0
>
>         Attachments: 
> part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet
>
>
> Reading the attached parquet file into pandas dataframe and then using the 
> dataframe segfaults.
> {noformat}
> Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 
> >>> import pyarrow
> >>> import pyarrow.parquet as pq
> >>> pyarrow.__version__
> '0.6.0'
> >>> import pandas as pd
> >>> pd.__version__
> '0.19.0'
> >>> df = 
> >>> pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet')
> >>>  \
> ...        .to_pandas()
> >>> len(df)
> 69
> >>> df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 69 entries, 0 to 68
> Data columns (total 6 columns):
> label               69 non-null int32
> account_meta        69 non-null object
> features_type       69 non-null int32
> features_size       69 non-null int32
> features_indices    1 non-null object
> features_values     1 non-null object
> dtypes: int32(3), object(3)
> memory usage: 2.5+ KB
> >>> 
> >>> pd.concat([df, df])
> Segmentation fault (core dumped)
> {noformat}
> Actually just print(df) is enough to trigger the segfault



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (PARQUET-1100) [C++] Reading repeated types should decode number of records rather than number of values

Reply via email to