alamb commented on PR #19819:
URL: https://github.com/apache/datafusion/pull/19819#issuecomment-3756314806

   > I guess ClickHouse does coerce them? So is this just a case of our 
semantics != ClickHouse?
   
   The problem is only with the ClickHouse `hits_partitioned` dataset (which 
has 100 parquet files) which was written by an ancient version of pyarrow which 
did not correctly annotate the string columns as STRING. This is the only 
usecase for this config:
   
   https://datafusion.apache.org/user-guide/configs.html
   
   datafusion.execution.parquet.binary_as_string | false | (reading) If true, 
parquet reader will read columns of Binary/LargeBinary with Utf8, and 
BinaryView with Utf8View. Parquet files generated by some legacy writers do not 
correctly set the UTF8 flag for strings, causing string columns to be loaded as 
BLOB instead.
   -- | -- | --
   
   
   So TLDR I don't think we need to make `split_part` work with binary data, as 
proposed in this PR.
   
   Instead I propose:
   1. Change the internal error to a proper error when running `split_part` on 
binary data
   2. Update the benchmarks to properly set the binary as string. I made a PR 
here: https://github.com/apache/datafusion/pull/19835
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to