[jira] [Created] (DRILL-7762) Parquet files with too many columns generated in Python (pyarrow, pandas) are not readable

Maarten D'Haene (Jira) Wed, 01 Jul 2020 07:14:56 -0700

Maarten D'Haene created DRILL-7762:
--------------------------------------

             Summary: Parquet files with too many columns generated in Python 
(pyarrow, pandas) are not readable  
                 Key: DRILL-7762
                 URL: https://issues.apache.org/jira/browse/DRILL-7762
             Project: Apache Drill
          Issue Type: Bug
          Components: Functions - Drill, SQL Parser, Storage - Parquet
    Affects Versions: 1.17.0
            Reporter: Maarten D'Haene
             Fix For: Future
         Attachments: error_drill_parquet.doc, shape_file_snappy512.parquet


When launching a query 
|SELECT * FROM s3.datascience.`./government/shape_file_snappy512.parquet` |

on a parquet-file with too many columns generated in Python, I get following 
error: 

User Error Occurred: Error in drill parquet reader (complex). Message: Failure 
in setting up reader Parquet Metadata: ParquetMetaData\{FileMetaData{schema: 
message schema { optional int64 OBJECTID_1; optional int64 OBJECTID; optional 
binary Cs012011 (UTF8); optional double Nis_012011; optional binary Sec012011 
(UTF8); optional binary CS102001 (UTF8); optional binary CS031991 (UTF8); 
optional binary CS031981 (UTF8); optional binary Sector_nl (UTF8); optional 
binary Sector_fr (UTF8); optional binary Gemeente (UTF8); optional binary 
Commune (UTF8); optional binary Arrond_nl (UTF8); optional binary Arrond_fr 
(UTF8); optional binary Prov_nl (UTF8); optional binary Prov_fr (UTF8); 
optional binary Reg_nl (UTF8); optional binary Reg_fr (UTF8); optional binary 
Nuts1 (UTF8); optional binary Nuts2 (UTF8); optional binary Nuts3_new (UTF8); 
optional int64 Inhab; optional double Gis_Perime; optional double Gis_area_h; 
optional double Cad_area_h; optional double Shape_Leng; optional double 
Shape_Area; optional binary codesecteu (UTF8); optional binary CD_REFNIS 
(UTF8); optional binary CD_SECTOR (UTF8); optional double TOTAL; optional 
double MALES; optional double FEMALES; optional double group0_14; optional 
double group15_64; optional double group65ETP; optional binary areaofdis 
(UTF8); }

The parquet file is generated using pyarrow with compression codec 'snappy' and 
data page size 512MB. Smaller/bigger page sizes give same error. The files 
originate on on-premise s3 object store (dell ecs). Other queries on the same 
parquet-file (count(*), select OBJECTID_1 from .. ) succeed succesfully. Doing 
a 'select *' on a parquet-file with less columns generated the same way also 
run without any issues. A workaround is to export a csv-file from Python and 
generate the parquet file using Drill itself starting from this csv-file 
(CREATE TABLE s3.datascience.`./government/tes3` AS SELECT * FROM 
s3.datascience.`./government/shape_file.csv`). Querying a parquet-file 
generated this way don't result in any problems (although content is exactly 
the same as parquet-file generated in Python). Is there an explanation why 
Drill acts this way and what are the specifications of the parquet-file 
generated by Drill itself (so we can aim to match these specification when 
creating a parquet-file using Pyarrow/Pandas)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7762) Parquet files with too many columns generated in Python (pyarrow, pandas) are not readable

Reply via email to