goldmedal opened a new issue, #12637:
URL: https://github.com/apache/datafusion/issues/12637
### Describe the bug
While working on #12415, I found the `LIKE` and `ILIKE` behavior differs
between `StringView` and other string types. Given the following data and SQL:
```sql
DataFusion CLI v42.0.0
> create table test_source as values
('Andrew', 'X', 'datafusion📊🔥', '🔥'),
('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
(NULL, 'R', NULL, '🔥');
0 row(s) fetched.
Elapsed 0.028 seconds.
> create table test_basic_operator_utf8view as
select
arrow_cast(column1, 'Utf8View') as ascii_1,
arrow_cast(column2, 'Utf8View') as ascii_2,
arrow_cast(column3, 'Utf8View') as unicode_1,
arrow_cast(column4, 'Utf8View') as unicode_2
from test_source;
0 row(s) fetched.
Elapsed 0.002 seconds.
> create table test_basic_operator_utf8 as
select
arrow_cast(column1, 'Utf8') as ascii_1,
arrow_cast(column2, 'Utf8') as ascii_2,
arrow_cast(column3, 'Utf8') as unicode_1,
arrow_cast(column4, 'Utf8') as unicode_2
from test_source;
0 row(s) fetched.
Elapsed 0.004 seconds.
-- StringView Table ( pattern contain % )
> select ascii_1, unicode_1,
ascii_1 like 'An%' as ascii_like,
unicode_1 like '%ion数据%' as unicode_like,
ascii_1 ilike 'An%' as ascii_ilike,
unicode_1 ilike '%ion数据%' as unicode_ilik
from test_basic_operator_utf8view;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1 | unicode_1 | ascii_like | unicode_like | ascii_ilike
| unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew | datafusion📊🔥 | true | false | true |
false |
| Xiangpeng | datafusion数据融合 | false | true | false |
true |
| Raphael | datafusionДатаФусион | false | false | false
| false |
| | | false | false | false
| |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched.
Elapsed 0.004 seconds.
-- StringView Table ( pattern without % )
> select ascii_1, unicode_1,
ascii_1 like 'An' as ascii_like,
unicode_1 like '%ion数据' as unicode_like,
ascii_1 ilike 'An' as ascii_ilike,
unicode_1 ilike 'ion数据' as unicode_ilik
from test_basic_operator_utf8view;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1 | unicode_1 | ascii_like | unicode_like | ascii_ilike
| unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew | datafusion📊🔥 | false | false | false |
false |
| Xiangpeng | datafusion数据融合 | false | false | false |
false |
| Raphael | datafusionДатаФусион | false | false | false
| false |
| | | | false |
| |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched.
Elapsed 0.004 seconds.
-- String Table (same as LargeString and DictionaryString)
> select ascii_1, unicode_1,
ascii_1 like 'An%' as ascii_like,
unicode_1 like '%ion数据%' as unicode_like,
ascii_1 ilike 'An%' as ascii_ilike,
unicode_1 ilike '%ion数据%' as unicode_ilik
from test_basic_operator_utf8;
+-----------+----------------------+------------+--------------+-------------+--------------+
| ascii_1 | unicode_1 | ascii_like | unicode_like | ascii_ilike
| unicode_ilik |
+-----------+----------------------+------------+--------------+-------------+--------------+
| Andrew | datafusion📊🔥 | true | false | true |
false |
| Xiangpeng | datafusion数据融合 | false | true | false |
true |
| Raphael | datafusionДатаФусион | false | false | false
| false |
| | | | |
| |
+-----------+----------------------+------------+--------------+-------------+--------------+
4 row(s) fetched.
Elapsed 0.004 seconds.
```
When the input value is NULL, string type will return `NULL` but string view
will return false. (Something is interesting about the ILIKE operation is
different between ASCII-only and Unicode `StringView` 🤔 )
## Some testing for StringView ScalarValue
When the matching pattern contains `%`, it will return `false` instead
`null`.
```sql
> select
arrow_cast(null, 'Utf8View') like '123' as utf8view_like_ascii,
arrow_cast(null, 'Utf8View') ilike '123' as utf8view_ilike_ascii,
arrow_cast(null, 'Utf8View') like '123%' as "utf8view_like_ascii%",
arrow_cast(null, 'Utf8View') ilike '123%' as "utf8view_ilike_ascii%";
+---------------------+----------------------+----------------------+-----------------------+
| utf8view_like_ascii | utf8view_ilike_ascii | utf8view_like_ascii% |
utf8view_ilike_ascii% |
+---------------------+----------------------+----------------------+-----------------------+
| | | false | false
|
+---------------------+----------------------+----------------------+-----------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select
arrow_cast(null, 'Utf8View') like '數據' as "utf8view_like_unicode",
arrow_cast(null, 'Utf8View') ilike '數據' as "utf8view_ilike_unicode",
arrow_cast(null, 'Utf8View') like '數據%' as "utf8view_like_unicode%",
arrow_cast(null, 'Utf8View') ilike '數據%' as "utf8view_ilike_unicode%";
+-----------------------+------------------------+------------------------+-------------------------+
| utf8view_like_unicode | utf8view_ilike_unicode | utf8view_like_unicode% |
utf8view_ilike_unicode% |
+-----------------------+------------------------+------------------------+-------------------------+
| | | false |
|
+-----------------------+------------------------+------------------------+-------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
```
## Some testing for String ScalarValue (Same as LargeString and
DictionaryString)
```sql
> select
arrow_cast(null, 'Utf8') like '123' as utf8_like_ascii,
arrow_cast(null, 'Utf8') ilike '123' as utf8_ilike_ascii,
arrow_cast(null, 'Utf8') like '123%' as "utf8_like_ascii%",
arrow_cast(null, 'Utf8') ilike '123%' as "utf8_ilike_ascii%";
+-----------------+------------------+------------------+-------------------+
| utf8_like_ascii | utf8_ilike_ascii | utf8_like_ascii% | utf8_ilike_ascii% |
+-----------------+------------------+------------------+-------------------+
| | | | |
+-----------------+------------------+------------------+-------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select
arrow_cast(null, 'Utf8') like '數據' as "utf8_like_unicode",
arrow_cast(null, 'Utf8') ilike '數據' as "utf8_ilike_unicode",
arrow_cast(null, 'Utf8') like '數據%' as "utf8_like_unicode%",
arrow_cast(null, 'Utf8') ilike '數據%' as "utf8_ilike_unicode%";
+-------------------+--------------------+--------------------+---------------------+
| utf8_like_unicode | utf8_ilike_unicode | utf8_like_unicode% |
utf8_ilike_unicode% |
+-------------------+--------------------+--------------------+---------------------+
| | | |
|
+-------------------+--------------------+--------------------+---------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
````
### To Reproduce
Run the SQLs mentioned above.
### Expected behavior
I'm not really sure if the behavior of StringView is expected 🤔 but I think
their behavior should be consistent.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]