Ying Wang created ARROW-3210: -------------------------------- Summary: Creating ParquetDataset with PyArrow creates partitioned ParquetFiles with mismatched Parquet schemas Key: ARROW-3210 URL: https://issues.apache.org/jira/browse/ARROW-3210 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Ubuntu 16.04 LTS, System76 Oryx Pro Reporter: Ying Wang Attachments: environment.yml, repro.csv, repro.py, repro_2.py
STEPS TO REPRODUCE: 1. Create a conda environment reflecting [^environment.yml] 2. Execute script [^repro.py], replacing various config variables to create a ParquetDataset on S3 given [^repro.csv] 3. Create reference of ParquetDataset using script [^repro_2.py], again replacing various config variables. EXPECTED: Reference is created correctly. GOT: Mismatched Arrow schemas in validate_schemas() method: ```python *** ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different. Record_ID: int64 y: double TRACKID: string MMSI: int64 IMO: int64 AgeMinutes: double SoG: double Width: int64 Length: int64 Callsign: string Destination: string ETA: int64 Status: string ExtraInfo: string TIMESTAMP: int64 __index_level_0__: int64 metadata -------- {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_' b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":' b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y' b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T' b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ' b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": ' b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name' b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan' b'das_type": "float64", "numpy_type": "float64", "metadata": null}' b', {"name": "Width", "field_name": "Width", "pandas_type": "int64' b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", ' b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca' b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}, {"name": "Destination", "field_name": "Destination' b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int' b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"' b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name' b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"' b', "metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMEST' b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":' b' null}, {"name": null, "field_name": "__index_level_0__", "panda' b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa' b'ndas_version": "0.21.0"}'} vs Record_ID: int64 y: double TRACKID: string MMSI: int64 IMO: int64 AgeMinutes: double SoG: double Width: int64 Length: int64 Callsign: string Destination: string ETA: int64 Status: string ExtraInfo: null TIMESTAMP: int64 __index_level_0__: int64 metadata -------- {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_' b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":' b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y' b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T' b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ' b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": ' b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name' b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan' b'das_type": "float64", "numpy_type": "float64", "metadata": null}' b', {"name": "Width", "field_name": "Width", "pandas_type": "int64' b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", ' b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca' b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}, {"name": "Destination", "field_name": "Destination' b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int' b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"' b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name' b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", ' b'"metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMESTAM' b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n' b'ull}, {"name": null, "field_name": "__index_level_0__", "pandas_' b'type": "int64", "numpy_type": "int64", "metadata": null}], "pand' b'as_version": "0.21.0"}'} ``` The issue is with column *ExtraInfo*, where *pandas_type* is *unicode* in a partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, while the ParquetDataset schema referencing the 1st Parquet file created has *pandas_type* *empty* for that same column. -- This message was sent by Atlassian JIRA (v7.6.3#76005)