alamb opened a new issue, #15096: URL: https://github.com/apache/datafusion/issues/15096
### Is your feature request related to a problem or challenge? DataFusion uses Arrow types internally. Thus when planning SQL queries there is a mapping from SQL types to Arrow Types. The current mapping for character types is shown in the docs https://datafusion.apache.org/user-guide/sql/data_types.html#character-types | SQL DataType | Arrow DataType | | ------------ | -------------- | | `CHAR` | `Utf8` | | `VARCHAR` | `Utf8` | | `TEXT` | `Utf8` | | `STRING` | `Utf8` | So this means that when you do something like `create table foo(x varchar);` the `x` column is Utf8 ```sql DataFusion CLI v46.0.0 > create table foo(x varchar); 0 row(s) fetched. Elapsed 0.019 seconds. > describe foo; +-------------+-----------+-------------+ | column_name | data_type | is_nullable | +-------------+-----------+-------------+ | x | Utf8 | YES | +-------------+-----------+-------------+ 1 row(s) fetched. Elapsed 0.008 seconds. ``` When reading parquet files however, a different type, `Utf8View` is used as it is faster in most cases. This can be seen in this example: ```sql DataFusion CLI v46.0.0 > describe 'hits.parquet'; +-----------------------+-----------+-------------+ | column_name | data_type | is_nullable | +-----------------------+-----------+-------------+ | WatchID | Int64 | NO | | JavaEnable | Int16 | NO | | Title | Utf8View | NO | ... +-----------------------+-----------+-------------+ 105 row(s) fetched. Elapsed 0.032 seconds. ``` Thus there is a discrepancy when creating external tables with a schema (`VARCHAR`) as that will use Utf8 rather than UTF8View I believe this is the root cause of the issue @zhuqi-lucas filed: - https://github.com/apache/datafusion/issues/14909 ### Describe the solution you'd like I think we should consider changing the default SQL mapping from `VARCHAR` --> `Utf8View` ### Describe alternatives you've considered - @zhuqi-lucas has a PR that does this: https://github.com/apache/datafusion/pull/14922 There are a few subtasks required before we can merge it: - [ ] https://github.com/apache/arrow-rs/issues/7244 ### Additional context You can see some of the history related to using string view / Utf8View here: - https://github.com/apache/datafusion/issues/11752 - -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
