alamb opened a new issue, #23086:
URL: https://github.com/apache/datafusion/issues/23086

   ### Is your feature request related to a problem or challenge?
   
   DataFusion's ClickBench benchmark queries currently use `length(...)` for 
string-length aggregations:
   
   - 
[`benchmarks/queries/clickbench/queries/q27.sql`](https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries/q27.sql)
   - 
[`benchmarks/queries/clickbench/queries/q28.sql`](https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries/q28.sql)
   - 
[`benchmarks/queries/clickbench/extended/q9.sql`](https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/extended/q9.sql)
   - 
[`benchmarks/queries/clickbench/extended/q10.sql`](https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/extended/q10.sql)
   
   In DataFusion, 
[`length`](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#length)
 is an alias of 
[`character_length`](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#character_length),
 which returns the number of characters in a string. DataFusion's 
[`octet_length`](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#octet_length)
 returns the length of a string in bytes.
   
   This differs from the benchmark semantics used by ClickHouse and DuckDB in 
ClickBench:
   
   - The upstream ClickBench ClickHouse queries use `length(URL)` / 
`length(Referer)` in Q27 and Q28: 
[`ClickHouse/ClickBench/clickhouse/queries.sql`](https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/queries.sql#L28-L29).
 In ClickHouse, `length` is byte-oriented; the ClickHouse string function docs 
distinguish this from 
[`lengthUTF8`](https://clickhouse.com/docs/sql-reference/functions/string-functions#lengthutf8),
 which returns Unicode code points rather than bytes.
   - The upstream ClickBench DuckDB queries use `STRLEN(URL)` / 
`STRLEN(Referer)` in Q27 and Q28: 
[`ClickHouse/ClickBench/duckdb/queries.sql`](https://github.com/ClickHouse/ClickBench/blob/main/duckdb/queries.sql#L28-L29),
 
[`duckdb-parquet/queries.sql`](https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/queries.sql#L28-L29),
 and 
[`duckdb-parquet-partitioned/queries.sql`](https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet-partitioned/queries.sql#L28-L29).
 DuckDB documents 
[`strlen(string)`](https://duckdb.org/docs/stable/sql/functions/text#strlenstring)
 as returning the number of bytes in a string, while 
[`length(string)`](https://duckdb.org/docs/stable/sql/functions/text#lengthstring)
 returns the number of characters.
   
   Because byte length can be computed from string offsets, while character 
length must inspect UTF-8 contents, using `octet_length` in these benchmark 
queries may avoid unnecessary work and improve ClickBench performance without 
changing the intended benchmark semantics.
   
   ### Describe the solution you'd like
   
   Update DataFusion's ClickBench SQL benchmark queries to use 
`octet_length(...)` where the upstream ClickBench / DuckDB benchmark is 
measuring byte length.
   
   Likely changes:
   
   - `AVG(length("URL"))` -> `AVG(octet_length("URL"))` in Q27
   - `AVG(length("Referer"))` -> `AVG(octet_length("Referer"))` in Q28
   - Review the extended ClickBench Q9/Q10 `LENGTH(...)` usages and switch to 
`octet_length(...)` if those queries should also preserve ClickHouse 
byte-length semantics.
   
   After changing the benchmark SQL, compare ClickBench timings before and 
after, especially Q27 and Q28.
   
   ### Describe alternatives you've considered
   
   Leave the queries as-is. That preserves current DataFusion SQL semantics, 
but likely benchmarks character-counting work that ClickHouse and DuckDB are 
not doing for Q27 and Q28 in ClickBench.
   
   Add a ClickHouse compatibility alias where `length` means byte length. I do 
not think this is appropriate globally because DataFusion currently documents 
`length` as an alias of `character_length`.
   
   ### Additional context
   
   Related ClickBench tracking:
   
   - #18489
   - #21706
   - #22633
   - #22804
   - #22807
   
   Suggested update to #18489 checklist:
   
   ```markdown
   - [ ] Use `octet_length` instead of `length` in ClickBench SQL where 
upstream benchmark semantics are byte length: <new issue URL>
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to