mingmwang opened a new issue #1688:
URL: https://github.com/apache/arrow-datafusion/issues/1688
**Describe the bug**
Recently I did some test on DataFusion's String related build-in functions.
I found there are lots of gaps compared with Spark SQL's (3.x) result. Some of
them maybe bug in Spark SQL but I believe most of them are bugs in DataFusion.
**To Reproduce**
Please see the details below:
----------------------------
upper()
SELECT upper('Haßler');
Spark returns "HASSLER"
DataFusion returns "HAßLER"
----------------------------
array()
SELECT array(1, 2, 3);
Spark returns [1,2,3]
DataFusion returns Error: NotImplemented("Array is not implemented for
scalar values.")
----------------------------
ascii()
SELECT ascii('ß');
Spark returns -61
DataFusion returns 223
SELECT ascii('ℝ');
Spark returns -30
DataFusion returns 8477
----------------------------
bit_length()
SELECT bit_length('Spark SQL');
SELECT bit_length('ß');
SELECT bit_length('𠎠');
SELECT bit_length('');
DataFusion results match with Spark.
----------------------------
char() /chr()
Spark supports both char() and chr(). DataFusion only supports chr().
SELECT chr(65);
Result matches
SELECT chr(0);
Spark returns empty result.
DataFusion returns Error: Execution("null character not permitted.")
SELECT chr(-1);
Spark returns empty result.
DataFusion returns Error: Execution("requested character too large for
encoding."
SELECT chr(65.0);
Spark returns 'A'.
DataFusion returns Error: Plan("Coercion from [Float64] to the signature
Uniform(1, [Int64]) failed.")
SELECT chr(10000000000000);
Spark returns empty result.
DataFusion returns Error: Execution("requested character too large for
encoding.")
----------------------------
char_length()
SELECT char_length('Haßler');
Result matches.
SELECT char_length('é');
Result matches.
SELECT char_length('é');
Spark returns 2.
DataFusion returns 1
SELECT char_length('𠎠');
Result matches.
SELECT char_length('𝕆');
Result matches.
SELECT char_length('');
Result matches.
----------------------------
concat()
SELECT concat('Spark', 'SQL', '', '', '');
Result matches.
----------------------------
concat_ws()
SELECT concat_ws('Spark', 'Spark', 'SQL','','','','');
Result matches.
SELECT concat_ws('s');
Spark returns empty result.
DataFusion returns Error: Internal("concat_ws was called with 1 arguments.
It requires at least 2.")
----------------------------
date_part()
DataFusion only support "hour" and "year" as the Date part.
SELECT date_part('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456');
Result matches
SELECT date_part('YEAR', DATE'2019-08-12');
Result matches
SELECT date_part('YEAR', '2019-08-12');
Spark returns 2019.
DataFusion returns
Error: Plan("Coercion from [Utf8, Utf8] to the signature OneOf([Exact([Utf8,
Date32]), Exact([Utf8, Date64]), Exact([Utf8, Timestamp(Second, None)]),
Exact([Utf8, Timestamp(Microsecond, None)]), Exact([Utf8,
Timestamp(Millisecond, None)]), Exact([Utf8, Timestamp(Nanosecond, None)])])
failed.")
SELECT date_part('hour', TIMESTAMP '2019-08-12 01:00:00.123456');
Spark returns 1
DataFusion returns 17
DataFusion doesn't support the following:
SELECT date_part('week', timestamp'2019-08-12 01:00:00.123456');
SELECT date_part('doy', DATE'2019-08-12');
SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
SELECT date_part('days', interval 5 days 3 hours 7 minutes);
SELECT date_part('seconds', interval 5 hours 30 seconds 1 milliseconds 1
microseconds);
SELECT date_part('MONTH', INTERVAL '2021-11' YEAR TO MONTH);
SELECT date_part('MINUTE', INTERVAL '123 23:55:59.002001' DAY TO SECOND);
----------------------------
date_trunc()
DataFusion only support lower case granularity like
year/quarter/month/week/day/hour/minute/second.
SELECT date_trunc('year', '2015-03-05T09:32:05.359');
Spark returns '2015-01-01 00:00:00.0'.
DataFusion returns Error: Plan("Coercion from [Utf8, Utf8] to the signature
Exact([Utf8, Timestamp(Nanosecond, None)]) failed.").
SELECT date_trunc('year', timestamp'2015-03-05T09:32:05.359')
Spark returns '2015-01-01 00:00:00.0'.
DataFusion returns '2015-01-01 00:00:00'
DataFusion doesn't support the following:
SELECT date_trunc('MM', '2015-03-05T09:32:05.359');
SELECT date_trunc('DD', '2015-03-05T09:32:05.359');
SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359');
SELECT date_trunc('SECOND', timestamp'2015-03-05T09:32:05.123456');
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456');
----------------------------
decode()
DataFusion doesn't support this function
SELECT decode(encode('abc', 'utf-8'), 'utf-8');
SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4,
'Seattle', 'Non domestic');
SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4,
'Seattle', 'Non domestic');
SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4,
'Seattle');
----------------------------
encode()
DataFusion doesn't support this function
SELECT encode('abc', 'utf-8');
----------------------------
format_string()
DataFusion doesn't support this function
SELECT format_string("Hello World %d %s", 100, "days");
----------------------------
hex()
DataFusion function name is to_hex()
SELECT hex(17);
Return matches
SELECT hex('Spark SQL');
Spark returns '537061726B2053514C'.
DataFusion returns Error: Plan("Coercion from [Utf8] to the signature
Uniform(1, [Int64]) failed.")
----------------------------
instr()
DataFusion doesn't support this function
SELECT instr('SparkSQL', 'SQL');
----------------------------
initcap()
SELECT initcap('sPark sql');
Result matches.
SELECT initcap('ßsPark sql');
Spark returns 'ßspark Sql'.
DataFusion returns 'ßSpark Sq'.
SELECT initcap('ß'), upper('ß');
Spark returns 'ß SS'.
DataFusion returns 'ß ß'.
Looks like Spark's result is inconsistent.
----------------------------
like
SELECT 'Spark' like '_park';
Result matches
SELECT like('Spark', '_park');
DataFusion doesn't support like function
SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\\Users%';
Spark returns true.
DataFusion returns false.
SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%';
Spark returns true.
DataFusion returns false.
SELECT '%SystemDrive%/Users/John' like '/%SystemDrive/%//Users%' ESCAPE '/';
Spark returns true.
DataFusion returns Error: SQL(ParserError("Expected end of statement, found:
'/'")).
----------------------------
rlike, like any, like all
DataFusion doesn't support those.
----------------------------
left()
SELECT left('Spark SQL', 3);
Result matches.
SELECT left('Spark SQL', 0);
Result matches.
SELECT left('Spark SQL', -20);
Result matches.
SELECT left('Spark SQL', 20);
Result matches.
SELECT left('Spark SQL', -1);
Spark returns empty.
DataFusion returns 'Spark SQ'.
----------------------------
length()
SELECT length('Haßler');
Result matches.
SELECT length('é');
Result matches.
SELECT length('?');
SELECT length('测试测试');
SELECT length('');
Result matches.
SELECT length('é');
Spark returns 2
DataFusion returns 1
----------------------------
lpad()
SELECT lpad('hi', 5, '??');
SELECT lpad('hi', 1, '??');
SELECT lpad('hi', 0, '??');
Result matches.
SELECT lpad('hi', -1, '??');
Spark returns empty
DataFusion thread panic. thread 'main' panicked at 'capacity overflow',
library/alloc/src/raw_vec.rs:509:5
SELECT lpad('hi', 5);
Result matches.
SELECT lpad('hi', 100000, '??');
Spark returns.
DataFusion thread panic. thread 'main' panicked at 'attempt to add with
overflow',
/Users/xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-5.0.0/src/utils/mod.rs:44:
----------------------------
locate()
DataFusion doesn't support locate(). But DataFusion has another function
called strpos().
SELECT locate('bar', 'foobarbar');
SELECT locate('bar', 'foobarbar', 5);
----------------------------
ltrim(), btrim(), rtrim()
SELECT ltrim(' SparkSQL ');
SELECT length(ltrim(' SparkSQL '));
SELECT btrim(' SparkSQL ');
SELECT length(btrim(' SparkSQL '));
SELECT rtrim(' SparkSQL ');
SELECT length(rtrim(' SparkSQL '));
Result matches.
----------------------------
md5()
SELECT md5('Spark');
SELECT md5('测试测试');
Result matches.
----------------------------
nullif()
SELECT nullif(2, 2);
Spark returns NULL
DataFusion returns Error: NotImplemented("nullif does not support a literal
as first argument").
----------------------------
octet_length()
SELECT octet_length('Spark SQL');
Result matches.
----------------------------
position()
DataFusion doesn't support position() function.
SELECT position('bar', 'foobarbar');
SELECT position('bar', 'foobarbar', 5);
SELECT POSITION('bar' IN 'foobarbar');
----------------------------
printf()
DataFusion doesn't support position() function.
SELECT printf("Hello World %d %s", 100, "days");
----------------------------
repeat()
SELECT repeat('123', 0);
Result matches.
SELECT repeat('123', -1);
Spark returns empty.
DataFusion thread 'main' panicked at 'capacity overflow',
library/alloc/src/slice.rs:558:50
SELECT repeat('123', 100000);
Spark returns correct reuslt.
DataFusion thread 'main' panicked at 'attempt to add with overflow',
/Users/xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-5.0.0/src/utils/mod.rs:44:9
----------------------------
regexp()
DataFusion doesn't support regexp(). (regexp_match() is the funtion name in
DataFusion ??)
SELECT regexp('%SystemDrive%\Users\John', '%SystemDrive%\\Users.*');
----------------------------
regexp_extract()
DataFusion doesn't support regexp_extract().
SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);
----------------------------
regexp_replace()
SELECT regexp_replace('100-200', '(\\d+)', 'num');
Spark returns 'num-num'.
DataFusion returns '100-200'
----------------------------
replace()
SELECT replace('ABCabc', 'abc', 'DEF');
SELECT replace('ABABA', 'ABA', 'abA');
Result matches.
----------------------------
reverse()
SELECT reverse('Spark SQL');
Result matches.
----------------------------
substr()/substring()
Spark supports both but DataFusion only supports substr().
SELECT substr('Spark SQL', 5);
Result matches.
SELECT substr('Spark SQL', 5, 1);
Result matches.
SELECT substr('Spark SQL', -3);
Spark returns 'SQL'.
DataFusion returns 'Spark SQL'.
DataFusion doesn't support the following:
SELECT substr('Spark SQL' FROM 5);
SELECT substr('Spark SQL' FROM -3);
SELECT substr('Spark SQL' FROM 5 FOR 1);
----------------------------
split()
DataFusion doesn't support split(). DataFusion supports split_part() but the
behavior is not the same.
SELECT split('oneAtwoBthreeC', '[ABC]');
SELECT split('oneAtwoBthreeC', '[ABC]', -1);
SELECT split('oneAtwoBthreeC', '[ABC]', 2);
----------------------------
trim()
SELECT trim(' SparkSQL ');
Result matches.
SELECT trim(BOTH 'SL' FROM 'SSparkSQLS');
Result matches.
SELECT trim(LEADING 'SL' FROM 'SSparkSQLS');
Result matches.
SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS');
Result matches.
DataFusion doesn't support the following:
SELECT trim(BOTH FROM ' SparkSQL ');
SELECT trim(LEADING FROM ' SparkSQL ');
SELECT trim(TRAILING FROM ' SparkSQL ');
SELECT trim('SL' FROM 'SSparkSQLS');
----------------------------
timestamp()
DataFusion doesn't support the timestamp() function but support the
to_timestamp() function
SELECT timestamp(1230219000123);
Spark returns empty
SELECT to_timestamp(1230219000123);
DataFusion returns '1970-01-01 00:20:30.219000123'.
SELECT timestamp(1230219000);
Spark returns '2008-12-25 08:30:00.0'
SELECT to_timestamp(1230219000);
DataFusion returns 1970-01-01 00:00:01.230219
----------------------------
timestamp_millis()
DataFusion doesn't support the timestamp_millis() function but support the
to_timestamp_millis() function.
SELECT timestamp_millis(1230219000123);
Spark returns '2008-12-25 07:30:00.123'.
SELECT to_timestamp_millis(1230219000123);
DataFusion returns '2008-12-25 15:30:00.123'
----------------------------
timestamp_micros()
DataFusion doesn't support the timestamp_micros() function but support the
to_timestamp_micros() functionf.
SELECT timestamp_micros(1230219000123123);
Spark returns '2008-12-25 07:30:00.123123'
SELECT to_timestamp_micros(1230219000123123);
DataFusion returns '2008-12-25 15:30:00.123123'
----------------------------
timestamp_seconds()
DataFusion doesn't support the timestamp_seconds() function but support the
to_timestamp_seconds() function.
SELECT timestamp_seconds(1230219000);
Spark returns '2008-12-25 07:30:00'.
SELECT to_timestamp_seconds(1230219000);
DataFusion returns '2008-12-25 15:30:00'.
SELECT timestamp_seconds(1230219000.123);
Spark returns '2008-12-25 07:30:00.123'.
SELECT to_timestamp_seconds(1230219000.123);
DataFusion returns Error: ArrowError(CastError("Error parsing
'1230219000.123' as timestamp")).
SELECT to_timestamp_seconds(-100000000000000);
thread 'main' panicked at 'invalid or out-of-range datetime',
/Users/xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/naive/datetime.rs:117:18
----------------------------
translate()
SELECT translate('AaBbCc', 'abc', '123');
Result matches.
----------------------------
unbase64()
DataFusion doesn't support this function.
SELECT unbase64('U3BhcmsgU1FM');
----------------------------
unhex()
DataFusion doesn't support this function.
SELECT unhex('537061726B2053514C');
----------------------------
**Expected behavior**
A clear and concise description of what you expected to happen.
**Additional context**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]