[jira] [Commented] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

Joris Van den Bossche (JIRA) Tue, 28 May 2019 01:46:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849483#comment-16849483
 ]


Joris Van den Bossche commented on ARROW-5430:
----------------------------------------------

Thanks for the report! The error is actually not really related to the parquet 
code, but due to pyarrow trying to convert the large integers into a pyarrow 
Array. 
So a smaller example that reproduces the issue:

{code:python}
In [21]: pa.array([14989096668145380166, 15869664087396458664])                 
                                                                                
                
...
ArrowException: Unknown error: Python int too large to convert to C long

In [22]: pa.array([14989096668145380166, 15869664087396458664], 
type=pa.uint64())                                                               
                                
Out[22]: 
<pyarrow.lib.UInt64Array object at 0x7fab58d28cc8>
[
  -3457647405564171450,
  -2577079986313092952
]
{code}

So when specifying the type, pyarrow can correctly convert it, but there is 
apparently not yet an automatic inference for uint64.

I think a patch that tries uint64 in case the integers are too big is certainly 
welcome!

> [Python] Can read but not write parquet partitioned on large ints
> -----------------------------------------------------------------
>
>                 Key: ARROW-5430
>                 URL: https://issues.apache.org/jira/browse/ARROW-5430
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>         Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>            Reporter: Robin Kåveland
>            Priority: Minor
>              Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

Reply via email to