[ 
https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422572#comment-16422572
 ] 

Kyle Barron commented on ARROW-2372:
------------------------------------

I edited my code to the script below, which, I believe, writes a parquet file 
with just the first 2GB csv chunk, then with the first two, and so on, checking 
each time that it can open the output. Here's the traceback first, which 
suggests that it was able to open the Parquet file representing around 6GB of 
csv data, but not the Parquet file representing about 8GB of csv data.
{code:java}
Starting conversion, up to iteration 0
        0.12 minutes
Finished reading csv block 0
        0.43 minutes
Finished writing parquet block 0
        1.80 minutes
Starting conversion, up to iteration 1
        1.80 minutes
Finished reading csv block 0
        2.12 minutes
Finished writing parquet block 0
        3.49 minutes
Finished reading csv block 1
        3.80 minutes
Finished writing parquet block 1
        5.19 minutes
Starting conversion, up to iteration 2
        5.20 minutes
Finished reading csv block 0
        5.52 minutes
Finished writing parquet block 0
        6.91 minutes
Finished reading csv block 1
        7.22 minutes
Finished writing parquet block 1
        8.59 minutes
Finished reading csv block 2
        8.92 minutes
Finished writing parquet block 2
        10.29 minutes
Starting conversion, up to iteration 3
        10.29 minutes
Finished reading csv block 0
        10.60 minutes
Finished writing parquet block 0
        11.98 minutes
Finished reading csv block 1
        12.30 minutes
Finished writing parquet block 1
        13.66 minutes
Finished reading csv block 2
        13.98 minutes
Finished writing parquet block 2
        15.35 minutes
Finished reading csv block 3
        15.68 minutes
Finished writing parquet block 3
        17.05 minutes
---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-10-2fadd2a47023> in <module>()
     29         if j == i:
     30             writer.close()
---> 31             pf = 
pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet')
     32             pfs_dict[i] = pf
     33             break

~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
__init__(self, source, metadata, common_metadata)
     62         self.reader = ParquetReader()
     63         source = _ensure_file(source)
---> 64         self.reader.open(source, metadata=metadata)
     65         self.common_metadata = common_metadata
     66         self._nested_paths_by_prefix = self._build_nested_paths()

_parquet.pyx in pyarrow._parquet.ParquetReader.open()

error.pxi in pyarrow.lib.check_status()

ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
{code}
And the source code:
{code}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from time import time

t0 = time()

zcta_file = Path('gaz2016zcta5distancemiles.csv')

pfs_dict = {}

for i in range(17):
    itr = pd.read_csv(
        zcta_file,
        header=0,
        dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
        engine='c',
        chunksize=64617153)  # previously determined to be about 2GB of csv data

    msg = f'Starting conversion, up to iteration {i}'
    msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
    print(msg)

    j = 0
    for df in itr:
        msg = f'Finished reading csv block {j}'
        msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
        print(msg)

        table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
        if j == 0:
            writer = pq.ParquetWriter(f'gaz2016zcta5distancemiles_{i}.parquet', 
schema=table.schema)

        writer.write_table(table)

        msg = f'Finished writing parquet block {j}'
        msg += f'\n\t{(time() - t0) / 60:.2f} minutes'
        print(msg)

        if j == i:
            writer.close()
            pf = pq.ParquetFile(f'gaz2016zcta5distancemiles_{i}.parquet')
            pfs_dict[i] = pf
            break

        j += 1
{code}

> ArrowIOError: Invalid argument
> ------------------------------
>
>                 Key: ARROW-2372
>                 URL: https://issues.apache.org/jira/browse/ARROW-2372
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0, 0.9.0
>         Environment: Ubuntu 16.04
>            Reporter: Kyle Barron
>            Priority: Major
>
> I get an ArrowIOError when reading a specific file that was also written by 
> pyarrow. Specifically, the traceback is:
> {code:python}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
>  ---------------------------------------------------------------------------
>  ArrowIOError Traceback (most recent call last)
>  <ipython-input-18-149f11bf68a5> in <module>()
>  ----> 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet')
> ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in 
> _init_(self, source, metadata, common_metadata)
>  62 self.reader = ParquetReader()
>  63 source = _ensure_file(source)
>  ---> 64 self.reader.open(source, metadata=metadata)
>  65 self.common_metadata = common_metadata
>  66 self._nested_paths_by_prefix = self._build_nested_paths()
> _parquet.pyx in pyarrow._parquet.ParquetReader.open()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Here's a reproducible example with the specific file I'm working with. I'm 
> converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get 
> the source data:
> {code:bash}
> wget 
> https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip
> unzip gaz2016zcta5distancemiles.csv.zip{code}
> Then the basic idea from the [pyarrow Parquet 
> documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing]
>  is instantiating the writer class; looping over chunks of the csv and 
> writing them to parquet; then closing the writer object.
>  
> {code:python}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from pathlib import Path
> zcta_file = Path('gaz2016zcta5distancemiles.csv')
> itr = pd.read_csv(
>     zcta_file,
>     header=0,
>     dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64},
>     engine='c',
>     chunksize=64617153)
> schema = pa.schema([
>     pa.field('zip1', pa.string()),
>     pa.field('zip2', pa.string()),
>     pa.field('mi_to_zcta5', pa.float64())])
> writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema)
> print(f'Starting conversion')
> i = 0
> for df in itr:
>     i += 1
>     print(f'Finished reading csv block {i}')
>     table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3)
>     writer.write_table(table)
>     print(f'Finished writing parquet block {i}')
> writer.close()
> {code}
> Then running this python script produces the file 
> {code:java}
> gaz2016zcta5distancemiles.parquet{code}
> , but just attempting to read the metadata with `pq.ParquetFile()` produces 
> the above exception.
> I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would 
> complain on import of the csv if the columns in the data were not `string`, 
> `string`, and `float64`, so I think creating the Parquet schema in that way 
> should be fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to