[
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-5430:
-----------------------------------------
Labels: parquet (was: )
> [Python] Can read but not write parquet partitioned on large ints
> -----------------------------------------------------------------
>
> Key: ARROW-5430
> URL: https://issues.apache.org/jira/browse/ARROW-5430
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.13.0
> Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
> Reporter: Robin Kåveland
> Priority: Minor
> Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
> * Either the write fails
> * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
> File "<stdin>", line 2, in <module>
> File
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
> line 282, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
> line 129, in read
> **kwargs).to_pandas()
> File
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1152, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
> line 181, in read_parquet
> use_pandas_metadata=use_pandas_metadata)
> File
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 1014, in read
> use_pandas_metadata=use_pandas_metadata)
> File
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 587, in read
> dictionary = partitions.levels[i].dictionary
> File
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
> line 642, in dictionary
> dictionary = lib.array(integer_keys)
> File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
> File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
> File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to
> C long{code}
> I set the priority to minor here because it's easy enough to work around this
> in user code unless you really need the 64 bit hash (and you probably
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)