ozgrakkurt commented on PR #6821:
URL:
https://github.com/apache/arrow-datafusion/pull/6821#issuecomment-1616769515
> Thank you thank you @ozgrakkurt
>
> # Something is not quite right
> I tried playing around with these functions in datafusion-cli and I got
some strange results
>
> Like I expect this to return `'foo'` but it did not (I see one of your
test uses `arrow_cast` maybe?):
>
> ```
> ❯ select decode(encode('foo', 'hex'), 'hex');
> +-----------------------------------------------------+
> | decode(encode(Utf8("foo"),Utf8("hex")),Utf8("hex")) |
> +-----------------------------------------------------+
> | 666f6f |
> +-----------------------------------------------------+
> 1 row in set. Query took 0.003 seconds.
> ```
>
> I also don't think the code that takes `ColumnarValue::Array` is covered
by tests
>
> Thus I think this PR needs more tests, ideally via
[sqllogictests](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/README.md)
-- perhaps in
https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/functions.slt
>
> I think that would make the behavior of these functions clearer.
>
> Cases needing coverage
>
> 1. Scalar input (of the string/largestring/binary/largebinary) -- real
values, empty `''` string, and NULL
> 2. table input for each of the types (maybe a single table with different
columns?
> 3. both `hex` and `base64` arguments
>
> # Need update to the feature flags
> Note that when I initially tried to test it out, I got the following error
(I think because the main datafusion feature flags were not updated)
>
> ```shell
> DataFusion CLI v27.0.0
> ❯ select encode('foo', 'base64');
> Optimizer rule 'simplify_expressions' failed
> caused by
> Internal error: function encode requires compilation with feature flag:
encoding_expressions.. This was likely caused by a bug in DataFusion's code and
we would welcome that you file an bug report in our issue tracker
> ❯
> ```
>
> I had to add this to Cargo:
>
> ```diff
> diff --git a/datafusion/core/Cargo.toml b/datafusion/core/Cargo.toml
> index 468219470..4e0e2fde5 100644
> --- a/datafusion/core/Cargo.toml
> +++ b/datafusion/core/Cargo.toml
> @@ -38,10 +38,11 @@ path = "src/lib.rs"
> avro = ["apache-avro", "num-traits", "datafusion-common/avro"]
> compression = ["xz2", "bzip2", "flate2", "zstd", "async-compression"]
> crypto_expressions = ["datafusion-physical-expr/crypto_expressions",
"datafusion-optimizer/crypto_expressions"]
> -default = ["crypto_expressions", "regex_expressions",
"unicode_expressions", "compression"]
> +default = ["crypto_expressions", "encoding__expressions",
"regex_expressions", "unicode_expressions", "compression"]
> # Enables support for non-scalar, binary operations on dictionaries
> # Note: this results in significant additional codegen
> dictionary_expressions =
["datafusion-physical-expr/dictionary_expressions",
"datafusion-optimizer/dictionary_expressions"]
> +encoding__expressions = ["datafusion-physical-expr/encoding_expressions"]
> # Used for testing ONLY: causes all values to hash to the same value
(test for collisions)
> force_hash_collisions = []
> pyarrow = ["datafusion-common/pyarrow"]
> ```
>
> Once I did that and tested it out it worked great:
>
> ```
> ❯ create table foo as values ('foo', 'base64');
> 0 rows in set. Query took 0.002 seconds.
> ❯ select * from foo;
> +---------+---------+
> | column1 | column2 |
> +---------+---------+
> | foo | base64 |
> +---------+---------+
> 1 row in set. Query took 0.002 seconds.
> ❯ select encode(column1, 'base64') from foo;
> +------------------------------------+
> | encode(foo.column1,Utf8("base64")) |
> +------------------------------------+
> | Zm9v |
> +------------------------------------+
> 1 row in set. Query took 0.003 seconds.
> ❯ select encode(column1, column2) from foo;
> Internal error: Encode using dynamically decided method is not yet
supported. This was likely caused by a bug in DataFusion's code and we would
welcome that you file an bug report in our issue tracker
> ❯ select encode(column1, 'hex') from foo;
> +---------------------------------+
> | encode(foo.column1,Utf8("hex")) |
> +---------------------------------+
> | 666f6f |
> +---------------------------------+
> ```
All done except I couldn't find how to force a value to be largebinary or
largeutf8 in sqllogictest, can you help me with this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]