alamb commented on PR #19819: URL: https://github.com/apache/datafusion/pull/19819#issuecomment-3756314806
> I guess ClickHouse does coerce them? So is this just a case of our semantics != ClickHouse? The problem is only with the ClickHouse `hits_partitioned` dataset (which has 100 parquet files) which was written by an ancient version of pyarrow which did not correctly annotate the string columns as STRING. This is the only usecase for this config: https://datafusion.apache.org/user-guide/configs.html datafusion.execution.parquet.binary_as_string | false | (reading) If true, parquet reader will read columns of Binary/LargeBinary with Utf8, and BinaryView with Utf8View. Parquet files generated by some legacy writers do not correctly set the UTF8 flag for strings, causing string columns to be loaded as BLOB instead. -- | -- | -- So TLDR I don't think we need to make `split_part` work with binary data, as proposed in this PR. Instead I propose: 1. Change the internal error to a proper error when running `split_part` on binary data 2. Update the benchmarks to properly set the binary as string. I made a PR here: https://github.com/apache/datafusion/pull/19835 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
