James Porritt created ARROW-1446: ------------------------------------ Summary: Python: Writing more than 2^31 rows from pandas dataframe causes row count overflow error Key: ARROW-1446 URL: https://issues.apache.org/jira/browse/ARROW-1446 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.6.0 Reporter: James Porritt
I have the following code: {code} import pyarrow import pyarrow.parquet as pq client = pyarrow.HdfsClient("<host>", <port>, "<user>", driver='libhdfs3') abc_table = client.read_parquet('<source parquet>', nthreads=16) abc_df = abc_table.to_pandas() abc_table = pyarrow.Table.from_pandas(abc_df) with client.open('<target parquet>', 'wb') as f: pq.write_table(abc_table, f) {code} <source parquet> contains 2497301128 rows. During the write however I get the following error: {format} Traceback (most recent call last): File "pyarrow_cluster.py", line 29, in <module> main() File "pyarrow_cluster.py", line 26, in main pq.write_table(nmi_table, f) File "<home dir>/miniconda2/envs/parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 796, in write_table writer.write_table(table, row_group_size=row_group_size) File "_parquet.pyx", line 663, in pyarrow._parquet.ParquetWriter.write_table File "error.pxi", line 72, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Written rows: -1797666168 != expected rows: 2497301128in the current column chunk {format} The number of written rows specified suggests a 32-bit signed integer has overflowed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)