alamb commented on issue #12509:
URL: https://github.com/apache/datafusion/issues/12509#issuecomment-2366775581
I have been thinking about this, and I came up with a third option which is
to "push the casting into the scan"
Consider this plan for q28:
```
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
| plan_type | plan
$
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
| logical_plan | Sort: l DESC NULLS FIRST, fetch=25
$
| | Projection:
regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))
AS k, avg(character_length(hits_partitioned.Referer)) AS l, count(*) AS c,
min(hits_partitioned.Referer)
$
| | Filter: count(*) > Int64(100000)
$
| | Aggregate: groupBy=[[regexp_replace(__common_expr_1
AS hits_partitioned.Referer, Utf8("^https?://(?:www\.)?([^/]+)/.*$"),
Utf8("\1"))]], aggr=[[avg(CAST(character_length(__common_expr_1 AS
hits_partitioned.Referer) AS Float64)), count(Int64(1)) AS count(*),
min(hits_partitioned.Referer)]] $
| | Projection: CAST(hits_partitioned.Referer AS Utf8)
AS __common_expr_1, hits_partitioned.Referer
$
| | Filter: hits_partitioned.Referer !=
BinaryView("")
$
| | TableScan: hits_partitioned
projection=[Referer], partial_filters=[hits_partitioned.Referer !=
BinaryView("")]
```
The
```
Projection: CAST(hits_partitioned.Referer AS Utf8) AS __common_expr_1,
hits_partitioned.Referer
```
Is what is causing a non trivial slowdown.
The issue is that `hits_partitioned.Referer` is read as a `BinaryView`
The problem is that BinaryView --> Utf8View conversion is much slower than
reading Utf8View directly out of the parquet file due to the Utf8 optimization
described by @XiangpengHao in "Section 2.1: From binary to strings" of the
[string view
blog](https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/).
## Option 3: Implement push down casting (maybe as an Analyzer rule??)
The theory here is that some readers( such as the parquet reader) can
produce the data more effiicently in a particular format than creating it first
in one format before datafusion casts it to another.
So the plan above would basically push the cast down so the parquet reader
read the `hits_partitioned.Referer` as a `Utf8View` to begin with
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]