[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422572#comment-16422572 ]
Kyle Barron commented on ARROW-2372: ------------------------------------ I edited my code to the script below, which, I believe, writes a parquet file with just the first 2GB csv chunk, then with the first two, and so on, checking each time that it can open the output. Here's the traceback first, which suggests that it was able to open the Parquet file representing around 6GB of csv data, but not the Parquet file representing about 8GB of csv data. {code:java} Starting conversion, up to iteration 0 0.12 minutes Finished reading csv block 0 0.43 minutes Finished writing parquet block 0 1.80 minutes Starting conversion, up to iteration 1 1.80 minutes Finished reading csv block 0 2.12 minutes Finished writing parquet block 0 3.49 minutes Finished reading csv block 1 3.80 minutes Finished writing parquet block 1 5.19 minutes Starting conversion, up to iteration 2 5.20 minutes Finished reading csv block 0 5.52 minutes Finished writing parquet block 0 6.91 minutes Finished reading csv block 1 7.22 minutes Finished writing parquet block 1 8.59 minutes Finished reading csv block 2 8.92 minutes Finished writing parquet block 2 10.29 minutes Starting conversion, up to iteration 3 10.29 minutes Finished reading csv block 0 10.60 minutes Finished writing parquet block 0 11.98 minutes Finished reading csv block 1 12.30 minutes Finished writing parquet block 1 13.66 minutes Finished reading csv block 2 13.98 minutes Finished writing parquet block 2 15.35 minutes Finished reading csv block 3 15.68 minutes Finished writing parquet block 3 17.05 minutes --------------------------------------------------------------------------- ArrowIOError Traceback (most recent call last) <ipython-input-10-2fadd2a47023> in <module>() 29 if j == i: 30 writer.close() ---> 31 pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet') 32 pfs_dict[i] = pf 33 break ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata) 62 self.reader = ParquetReader() 63 source = _ensure_file(source) ---> 64 self.reader.open(source, metadata=metadata) 65 self.common_metadata = common_metadata 66 self._nested_paths_by_prefix = self._build_nested_paths() _parquet.pyx in pyarrow._parquet.ParquetReader.open() error.pxi in pyarrow.lib.check_status() ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument {code} And the source code: {code} import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pathlib import Path from time import time t0 = time() zcta_file = Path('gaz2016zcta5distancemiles.csv') pfs_dict = {} for i in range(17): itr = pd.read_csv( zcta_file, header=0, dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, engine='c', chunksize=64617153) # previously determined to be about 2GB of csv data msg = f'Starting conversion, up to iteration {i}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) j = 0 for df in itr: msg = f'Finished reading csv block {j}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) if j == 0: writer = pq.ParquetWriter(f'gaz2016zcta5distancemiles_{i}.parquet', schema=table.schema) writer.write_table(table) msg = f'Finished writing parquet block {j}' msg += f'\n\t{(time() - t0) / 60:.2f} minutes' print(msg) if j == i: writer.close() pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet') pfs_dict[i] = pf break j += 1 {code} > ArrowIOError: Invalid argument > ------------------------------ > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 > Reporter: Kyle Barron > Priority: Major > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --------------------------------------------------------------------------- > ArrowIOError Traceback (most recent call last) > <ipython-input-18-149f11bf68a5> in <module>() > ----> 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)