[
https://issues.apache.org/jira/browse/ARROW-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161583#comment-16161583
]
Wes McKinney edited comment on ARROW-1440 at 9/11/17 5:04 PM:
--------------------------------------------------------------
Well, this is moderately concerning. Here is the Spark schema for the file
{code}
message spark_schema {
optional int32 label;
optional group account_meta {
optional int32 cohort_date (DATE);
optional binary country_code (UTF8);
optional int32 arena;
optional int32 max_arena;
optional int32 xp_level;
}
optional int32 features_type;
optional int32 features_size;
optional group features_indices (LIST) {
repeated group list {
optional int32 element;
}
}
optional group features_values (LIST) {
repeated group list {
optional double element;
}
}
}
{code}
For features_indices, one of the problematic columns, the 69 definition levels
are all 3, while the repetition levels start with a 0 (the first record)
followed by 68 1's. This implies a single list entry with 69 values in it.
There should be more definition / repetition levels since the table has 69
*records*. parquet-cpp stopped decoding levels once it encountered 69 values,
but I think it should actually keep going until it finds 69 new records
(repetition level 0 here). I looked at this file in PySpark and it confirms:
{code}
label account_meta features_type features_size \
0 21 (2016-03-06, IN, 7, 7, 8) 0 3815
1 21 (2017-03-25, RU, 8, 8, 9) 0 3815
2 17 (2016-11-26, DE, 7, 7, 7) 0 3815
3 22 (2017-02-22, BR, 8, 8, 8) 0 3815
4 17 (2016-03-23, IT, 10, 10, 10) 0 3815
features_indices \
0 [1, 2, 5, 6, 7, 8, 11, 12, 13, 15, 17, 18, 21,...
1 [0, 1, 2, 5, 6, 7, 9, 12, 14, 15, 16, 17, 21, ...
2 [1, 4, 7, 9, 11, 12, 13, 14, 15, 16, 17, 19, 2...
3 [12, 15, 17, 22, 68, 70, 72, 74, 91, 96, 99, 1...
4 [1, 5, 8, 15, 17, 21, 24, 41, 68, 79, 85, 89, ...
features_values
0 [0.6931471805599453, 0.6931471805599453, 3.258...
1 [2.772588722239781, 0.6931471805599453, 1.6094...
2 [1.9459101490553132, 1.3862943611198906, 2.397...
3 [1.0986122886681098, 1.0986122886681098, 1.098...
4 [0.6931471805599453, 1.791759469228055, 1.7917...
{code}
The number of values in each entry is much larger than 69:
{code}
>>> df['features_indices'].map(len)
0 805
1 781
2 733
3 672
4 783
5 658
6 663
7 572
8 533
9 287
10 732
11 840
12 621
13 881
14 734
15 134
{code}
was (Author: wesmckinn):
Well, this is moderately concerning. Here is the Spark schema for the file
{code}
message spark_schema {
optional int32 label;
optional group account_meta {
optional int32 cohort_date (DATE);
optional binary country_code (UTF8);
optional int32 arena;
optional int32 max_arena;
optional int32 xp_level;
}
optional int32 features_type;
optional int32 features_size;
optional group features_indices (LIST) {
repeated group list {
optional int32 element;
}
}
optional group features_values (LIST) {
repeated group list {
optional double element;
}
}
}
{code}
For features_indices, one of the problematic columns, the 69 definition levels
are all 3, while the repetition levels start with a 0 (the first record)
followed by 68 1's. This implies a single list entry with 69 values in it.
There should be more definition / repetition levels since the table has 69
*records*. parquet-cpp stopped decoding levels once it encountered 69 values,
but I think it should actually keep going until it finds 69 new records
(repetition level 0 here). I looked at this file in PySpark and it confirms:
{code}
label account_meta features_type features_size \
0 21 (2016-03-06, IN, 7, 7, 8) 0 3815
1 21 (2017-03-25, RU, 8, 8, 9) 0 3815
2 17 (2016-11-26, DE, 7, 7, 7) 0 3815
3 22 (2017-02-22, BR, 8, 8, 8) 0 3815
4 17 (2016-03-23, IT, 10, 10, 10) 0 3815
features_indices \
0 [1, 2, 5, 6, 7, 8, 11, 12, 13, 15, 17, 18, 21,...
1 [0, 1, 2, 5, 6, 7, 9, 12, 14, 15, 16, 17, 21, ...
2 [1, 4, 7, 9, 11, 12, 13, 14, 15, 16, 17, 19, 2...
3 [12, 15, 17, 22, 68, 70, 72, 74, 91, 96, 99, 1...
4 [1, 5, 8, 15, 17, 21, 24, 41, 68, 79, 85, 89, ...
features_values
0 [0.6931471805599453, 0.6931471805599453, 3.258...
1 [2.772588722239781, 0.6931471805599453, 1.6094...
2 [1.9459101490553132, 1.3862943611198906, 2.397...
3 [1.0986122886681098, 1.0986122886681098, 1.098...
4 [0.6931471805599453, 1.791759469228055, 1.7917...
{code}
The number of values in each entry is much larger than 69:
{code}
0 805
1 781
2 733
3 672
4 783
5 658
6 663
7 572
8 533
9 287
10 732
11 840
12 621
13 881
14 734
15 134
{code}
> [Python] Segmentation fault after loading parquet file to pandas dataframe
> --------------------------------------------------------------------------
>
> Key: ARROW-1440
> URL: https://issues.apache.org/jira/browse/ARROW-1440
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.6.0
> Environment: ubuntu 16.04.2
> Reporter: Jarno Seppanen
> Assignee: Wes McKinney
> Fix For: 0.7.0
>
> Attachments:
> part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet
>
>
> Reading the attached parquet file into pandas dataframe and then using the
> dataframe segfaults.
> {noformat}
> Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 11:58:13)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> >>> import pyarrow
> >>> import pyarrow.parquet as pq
> >>> pyarrow.__version__
> '0.6.0'
> >>> import pandas as pd
> >>> pd.__version__
> '0.19.0'
> >>> df =
> >>> pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet')
> >>> \
> ... .to_pandas()
> >>> len(df)
> 69
> >>> df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 69 entries, 0 to 68
> Data columns (total 6 columns):
> label 69 non-null int32
> account_meta 69 non-null object
> features_type 69 non-null int32
> features_size 69 non-null int32
> features_indices 1 non-null object
> features_values 1 non-null object
> dtypes: int32(3), object(3)
> memory usage: 2.5+ KB
> >>>
> >>> pd.concat([df, df])
> Segmentation fault (core dumped)
> {noformat}
> Actually just print(df) is enough to trigger the segfault
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)