[
https://issues.apache.org/jira/browse/ARROW-15146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486115#comment-17486115
]
Weston Pace commented on ARROW-15146:
-------------------------------------
The original post is a little too vague to say for certain what is happening.
As for [~mattcarothers]'s issue (thank you for the reproducible test case) the
problem is that 42 is being interpreted as an int64 scalar. So then when the
filtering logic kicks in it is comparing a uint64 array with an int64 scalar
and it decides to downcast the uint64.
A workaround is:
{code}
import pyarrow as pa
df = pd.read_parquet('test.parquet', filters=[('col1', '=', pa.scalar(42,
type=pa.uint64()))])
{code}
I'm not entirely sure if this is a bug or not. The casting logic is pretty
complex as it is. However, preferring to cast literals before casting arrays
might be a reasonable rule (it also leads to better performance).
> ArrowInvalid: Integer value
> ----------------------------
>
> Key: ARROW-15146
> URL: https://issues.apache.org/jira/browse/ARROW-15146
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet
> Affects Versions: 6.0.1
> Environment: Ubuntu 20.04, PyArrow 6.01, Python 3.9
> Reporter: mondonomo
> Priority: Major
>
> I've created a parquet db with an uint64 datatype. When reading some of the
> files are raising the errors like
> {quote}ArrowInvalid: Integer value 12120467241726599441 not in range: 0 to
> 9223372036854775807
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)