[
https://issues.apache.org/jira/browse/ARROW-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314989#comment-17314989
]
Jonathan Keane commented on ARROW-12114:
----------------------------------------
Ok, I've dug a bit more and the arrowbench error is the opposite. We wrote the
query with a numeric and the column is actually a string (it contains the code
+ values are below).
I agree that implicitly casting a string to integer (or the reverse) is
surprising, it is something that R does and I think some of our R users would
expect this to continue working like it has so far. IIRC there was an R hack
for autocasting that we could re-implement for things like this or we could
improve the error message so it's a bit clearer what's going on there.
The r query against the nyc taxi dataset:
{code}
> result <- ds %>%
+ filter(payment_type == 3) %>%
+ select(year, month, passenger_count) %>%
+ group_by(year, month) %>%
+ collect() %>%
+ summarize(
+ total_passengers = sum(passenger_count, na.rm = TRUE),
+ n = n()
+ )
Error: NotImplemented: Function equal has no kernel matching input types
(array[string], scalar[double])
{code}
Values of the payment type column
{code}
[1] "CASH" "Credit" "CREDIT" "Cash" "No Charge" "Dispute"
"CAS" "Cre"
[9] "CRE" "Cas" "No " "Dis" "NA " "CRD"
"CSH" "NOC"
[17] "DIS" "UNK" "1" "2" "3" "4"
"5"
{code}
> [C++] Dataset to table filter expression API change
> ---------------------------------------------------
>
> Key: ARROW-12114
> URL: https://issues.apache.org/jira/browse/ARROW-12114
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Diana Clarke
> Assignee: Ben Kietzman
> Priority: Major
>
> Ben:
> Can you please confirm that we're aware and okay with the following API
> change? Thanks!
> {code}
> import pyarrow.dataset
> path_prefix = "ursa-labs-taxi-data-repartitioned-10k/"
> paths = [
>
> f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet"
> for year in range(2009, 2020)
> for month in range(1, 13)
> for part in range(101)
> if not (year == 2019 and month > 6) # Data ends in 2019/06
> and not (year == 2010 and month == 3) # Data is missing in 2010/03
> ]
> partitioning = pyarrow.dataset.DirectoryPartitioning.discover(
> field_names=["year", "month", "part"],
> infer_dictionary=True,
> )
> s3 = pyarrow.fs.S3FileSystem(region="us-east-2")
> dataset = pyarrow.dataset.dataset(
> paths,
> format="parquet",
> filesystem=s3,
> partitioning=partitioning,
> partition_base_dir=path_prefix,
> )
> year = pyarrow.dataset.field("year")
> month = pyarrow.dataset.field("month")
> part = pyarrow.dataset.field("part")
> filter_expr = (year == "2011") & (month == 1) & (part == 2)
> dataset.to_table(filter=filter_expr)
> {code}
> In arrow 3.0, the above code executes without error.
> On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes),
> raises the following exception.
> {code}
> pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching
> input types (array[int32], scalar[string])
> {code}
> This API change appears to have been introduced in ARROW-8919. Perhaps it was
> intentional, just figured we should double check. Thanks again!
> [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)