[
https://issues.apache.org/jira/browse/PARQUET-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168957#comment-16168957
]
Itai Incze commented on PARQUET-1100:
-------------------------------------
[~wesmckinn] - I have a suggestion for where I think the bug seem to originate
from - or at least the wrong length bug:
in {{FileReader::Impl::ReadSchemaField:433}} the calculation for {{batch_size}}
seems wrong - it is using the index for the top level schema field instead of
the one to the leaf column. This works for the flat case but not when there's a
(leading) struct.
Seems to me one should find the proper leaf column. (I guess that when dealing
with structs since they currently do not contain repetition it doesn't matter
which column is taken). I think I can offer a simple patch tomorrow if needed.
Nevertheless, I believe this refactoring is very good and a necessary one (that
is, having a stateful reader with an array builder) which fixes some problems I
ran into while trying to implement arbitrary nested reading.
> [C++] Reading repeated types should decode number of records rather than
> number of values
> -----------------------------------------------------------------------------------------
>
> Key: PARQUET-1100
> URL: https://issues.apache.org/jira/browse/PARQUET-1100
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Affects Versions: cpp-1.2.0
> Reporter: Jarno Seppanen
> Assignee: Wes McKinney
> Fix For: cpp-1.3.0
>
> Attachments:
> part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet
>
>
> Reading the attached parquet file into pandas dataframe and then using the
> dataframe segfaults.
> {noformat}
> Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 11:58:13)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> >>> import pyarrow
> >>> import pyarrow.parquet as pq
> >>> pyarrow.__version__
> '0.6.0'
> >>> import pandas as pd
> >>> pd.__version__
> '0.19.0'
> >>> df =
> >>> pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet')
> >>> \
> ... .to_pandas()
> >>> len(df)
> 69
> >>> df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 69 entries, 0 to 68
> Data columns (total 6 columns):
> label 69 non-null int32
> account_meta 69 non-null object
> features_type 69 non-null int32
> features_size 69 non-null int32
> features_indices 1 non-null object
> features_values 1 non-null object
> dtypes: int32(3), object(3)
> memory usage: 2.5+ KB
> >>>
> >>> pd.concat([df, df])
> Segmentation fault (core dumped)
> {noformat}
> Actually just print(df) is enough to trigger the segfault
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)