James Porritt created ARROW-1446:
------------------------------------
Summary: Python: Writing more than 2^31 rows from pandas dataframe
causes row count overflow error
Key: ARROW-1446
URL: https://issues.apache.org/jira/browse/ARROW-1446
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.6.0
Reporter: James Porritt
I have the following code:
{code}
import pyarrow
import pyarrow.parquet as pq
client = pyarrow.HdfsClient("<host>", <port>, "<user>", driver='libhdfs3')
abc_table = client.read_parquet('<source parquet>', nthreads=16)
abc_df = abc_table.to_pandas()
abc_table = pyarrow.Table.from_pandas(abc_df)
with client.open('<target parquet>', 'wb') as f:
pq.write_table(abc_table, f)
{code}
<source parquet> contains 2497301128 rows.
During the write however I get the following error:
{format}
Traceback (most recent call last):
File "pyarrow_cluster.py", line 29, in <module>
main()
File "pyarrow_cluster.py", line 26, in main
pq.write_table(nmi_table, f)
File "<home
dir>/miniconda2/envs/parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
line 796, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "_parquet.pyx", line 663, in pyarrow._parquet.ParquetWriter.write_table
File "error.pxi", line 72, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Written rows: -1797666168 != expected rows:
2497301128in the current column chunk
{format}
The number of written rows specified suggests a 32-bit signed integer has
overflowed.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)