[ 
https://issues.apache.org/jira/browse/ARROW-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161583#comment-16161583
 ] 

Wes McKinney edited comment on ARROW-1440 at 9/11/17 5:04 PM:
--------------------------------------------------------------

Well, this is moderately concerning. Here is the Spark schema for the file

{code}
message spark_schema {
  optional int32 label;
  optional group account_meta {
    optional int32 cohort_date (DATE);
    optional binary country_code (UTF8);
    optional int32 arena;
    optional int32 max_arena;
    optional int32 xp_level;
  }
  optional int32 features_type;
  optional int32 features_size;
  optional group features_indices (LIST) {
    repeated group list {
      optional int32 element;
    }
  }
  optional group features_values (LIST) {
    repeated group list {
      optional double element;
    }
  }
}
{code}

For features_indices, one of the problematic columns, the 69 definition levels 
are all 3, while the repetition levels start with a 0 (the first record) 
followed by 68 1's. This implies a single list entry with 69 values in it.

There should be more definition / repetition levels since the table has 69 
*records*. parquet-cpp stopped decoding levels once it encountered 69 values, 
but I think it should actually keep going until it finds 69 new records 
(repetition level 0 here). I looked at this file in PySpark and it confirms:

{code}
   label                  account_meta  features_type  features_size  \
0     21     (2016-03-06, IN, 7, 7, 8)              0           3815   
1     21     (2017-03-25, RU, 8, 8, 9)              0           3815   
2     17     (2016-11-26, DE, 7, 7, 7)              0           3815   
3     22     (2017-02-22, BR, 8, 8, 8)              0           3815   
4     17  (2016-03-23, IT, 10, 10, 10)              0           3815   

                                    features_indices  \
0  [1, 2, 5, 6, 7, 8, 11, 12, 13, 15, 17, 18, 21,...   
1  [0, 1, 2, 5, 6, 7, 9, 12, 14, 15, 16, 17, 21, ...   
2  [1, 4, 7, 9, 11, 12, 13, 14, 15, 16, 17, 19, 2...   
3  [12, 15, 17, 22, 68, 70, 72, 74, 91, 96, 99, 1...   
4  [1, 5, 8, 15, 17, 21, 24, 41, 68, 79, 85, 89, ...   

                                     features_values  
0  [0.6931471805599453, 0.6931471805599453, 3.258...  
1  [2.772588722239781, 0.6931471805599453, 1.6094...  
2  [1.9459101490553132, 1.3862943611198906, 2.397...  
3  [1.0986122886681098, 1.0986122886681098, 1.098...  
4  [0.6931471805599453, 1.791759469228055, 1.7917...  
{code}

The number of values in each entry is much larger than 69:

{code}
>>> df['features_indices'].map(len)

0     805
1     781
2     733
3     672
4     783
5     658
6     663
7     572
8     533
9     287
10    732
11    840
12    621
13    881
14    734
15    134
{code}


was (Author: wesmckinn):
Well, this is moderately concerning. Here is the Spark schema for the file

{code}
message spark_schema {
  optional int32 label;
  optional group account_meta {
    optional int32 cohort_date (DATE);
    optional binary country_code (UTF8);
    optional int32 arena;
    optional int32 max_arena;
    optional int32 xp_level;
  }
  optional int32 features_type;
  optional int32 features_size;
  optional group features_indices (LIST) {
    repeated group list {
      optional int32 element;
    }
  }
  optional group features_values (LIST) {
    repeated group list {
      optional double element;
    }
  }
}
{code}

For features_indices, one of the problematic columns, the 69 definition levels 
are all 3, while the repetition levels start with a 0 (the first record) 
followed by 68 1's. This implies a single list entry with 69 values in it.

There should be more definition / repetition levels since the table has 69 
*records*. parquet-cpp stopped decoding levels once it encountered 69 values, 
but I think it should actually keep going until it finds 69 new records 
(repetition level 0 here). I looked at this file in PySpark and it confirms:

{code}
   label                  account_meta  features_type  features_size  \
0     21     (2016-03-06, IN, 7, 7, 8)              0           3815   
1     21     (2017-03-25, RU, 8, 8, 9)              0           3815   
2     17     (2016-11-26, DE, 7, 7, 7)              0           3815   
3     22     (2017-02-22, BR, 8, 8, 8)              0           3815   
4     17  (2016-03-23, IT, 10, 10, 10)              0           3815   

                                    features_indices  \
0  [1, 2, 5, 6, 7, 8, 11, 12, 13, 15, 17, 18, 21,...   
1  [0, 1, 2, 5, 6, 7, 9, 12, 14, 15, 16, 17, 21, ...   
2  [1, 4, 7, 9, 11, 12, 13, 14, 15, 16, 17, 19, 2...   
3  [12, 15, 17, 22, 68, 70, 72, 74, 91, 96, 99, 1...   
4  [1, 5, 8, 15, 17, 21, 24, 41, 68, 79, 85, 89, ...   

                                     features_values  
0  [0.6931471805599453, 0.6931471805599453, 3.258...  
1  [2.772588722239781, 0.6931471805599453, 1.6094...  
2  [1.9459101490553132, 1.3862943611198906, 2.397...  
3  [1.0986122886681098, 1.0986122886681098, 1.098...  
4  [0.6931471805599453, 1.791759469228055, 1.7917...  
{code}

The number of values in each entry is much larger than 69:

{code}
0     805
1     781
2     733
3     672
4     783
5     658
6     663
7     572
8     533
9     287
10    732
11    840
12    621
13    881
14    734
15    134
{code}

> [Python] Segmentation fault after loading parquet file to pandas dataframe
> --------------------------------------------------------------------------
>
>                 Key: ARROW-1440
>                 URL: https://issues.apache.org/jira/browse/ARROW-1440
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.6.0
>         Environment: ubuntu 16.04.2
>            Reporter: Jarno Seppanen
>            Assignee: Wes McKinney
>             Fix For: 0.7.0
>
>         Attachments: 
> part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet
>
>
> Reading the attached parquet file into pandas dataframe and then using the 
> dataframe segfaults.
> {noformat}
> Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 
> >>> import pyarrow
> >>> import pyarrow.parquet as pq
> >>> pyarrow.__version__
> '0.6.0'
> >>> import pandas as pd
> >>> pd.__version__
> '0.19.0'
> >>> df = 
> >>> pq.read_table('part-00000-6570e34b-b42c-4a39-8adf-21d3a97fb87d.snappy.parquet')
> >>>  \
> ...        .to_pandas()
> >>> len(df)
> 69
> >>> df.info()
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 69 entries, 0 to 68
> Data columns (total 6 columns):
> label               69 non-null int32
> account_meta        69 non-null object
> features_type       69 non-null int32
> features_size       69 non-null int32
> features_indices    1 non-null object
> features_values     1 non-null object
> dtypes: int32(3), object(3)
> memory usage: 2.5+ KB
> >>> 
> >>> pd.concat([df, df])
> Segmentation fault (core dumped)
> {noformat}
> Actually just print(df) is enough to trigger the segfault



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to