[
https://issues.apache.org/jira/browse/IMPALA-14096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986246#comment-17986246
]
ASF subversion and git services commented on IMPALA-14096:
----------------------------------------------------------
Commit 85a2211bfbf8acfdaffdfbc898e0b18a0b4e8df6 in impala's branch
refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=85a2211bf ]
IMPALA-10349: Support constant folding for non ascii strings
Before this patch constant folding only converted the result of an
expression to StringLiteral if all characters were ASCII. The
change allows both UTF8 strings with non ascii characters and
byte arrays that are not valid UTF8 strings - the latter can
occur when constant folding is applied to BINARY columns,
for example in geospatial functions like st_polygon().
The main goal is being able to push down more predicates, e.g.
before that patch a filter like col="á" couldn't be pushed down
to Iceberg/Kudu/Parquet stat filtering, as all these expect literals.
Main changes:
- TStringLiteral uses a binary instead of a string member.
This doesn't affect BE as in c++ both types are compiled
to std::string. In Jave a java.nio.ByteBuffer is used instead of
String.
- StringLiteral uses a byte[] member to store the value of
the literal in case it is not valid UTF8 and cannot be
represented as Java String. In other cases still a String
is used to keep the change minimal, though it may be more
optimal to use UTF8 byte[] due to the smaller size. Always
converting from byte[] to String may be costy in the catalog
as partition values are stored as *Literals and rest of the
catalog operates on String.
- StringLiteral#compareTo() is switched to byte wise compare on byte[]
to be consistent with BE. This was not needed for ASCII strings
as Java String behaves the same way in that case, but non-ASCII
can have different order (note that Impala does not support
collations).
- When an invalid UTF8 StringLiteral is printed, for example in
case of EXPLAIN output, then it is printed as
unhex("<byte array in hexadecimal>"). This is a non-lossy way to
represent it, but it may be too verbose in some cases, e.g. for
large polygons. A follow up commit may refine this, e.g. by
limiting the max size printed.
An issue found while implementing this is that INSERT does not
handle invalid UTF8 partition values correctly, see IMPALA-14096.
This behavior is not changed in the patch.
Testing:
- Added a few tests that push down non-ascii const expressions in
predicates (both with utf8_mode=true and false).
Change-Id: I70663457a0b0a3443e586350f0a5996bb75ba64a
Reviewed-on: http://gerrit.cloudera.org:8080/22603
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Writing non-UTF8 partition values can lead to dirty writes
> ----------------------------------------------------------
>
> Key: IMPALA-14096
> URL: https://issues.apache.org/jira/browse/IMPALA-14096
> Project: IMPALA
> Issue Type: Bug
> Reporter: Csaba Ringhofer
> Priority: Major
>
> {code}
> create table tspart (s string) partitioned by (p string);
> insert into tspart partition (p="a") values ("a");
> insert into tspart partition (p="aa") values ("aa");
> -- s is not valid utf8
> insert into tspart partition (p="a") values (unhex("aa"));
> -- insert the table again but swap p and s, so one partition will be
> unhex("aa")
> insert into tspart partition (p) select p s_, concat(s, "a") p_ from tspart;
> -- leads to error:
> 2025-05-26 11:47:03 [Exception] ERROR: Query
> da440f13f21ab301:79918f1100000000 failed:
> Error(s) moving partition files. First error (of 1) was: Hdfs op (RENAME
> hdfs://localhost:20500/test-warehouse/tspart/_impala_insert_staging/da440f13f21ab301_79918f1100000000/.da440f13f21ab301-79918f1100000002_588063374_dir/p=�a/da440f13f21ab301-79918f1100000002_782687841_data.0.txt
> TO
> hdfs://localhost:20500/test-warehouse/tspart/p=�a/da440f13f21ab301-79918f1100000002_782687841_data.0.txt)
> failed, error was:
> hdfs://localhost:20500/test-warehouse/tspart/_impala_insert_staging/da440f13f21ab301_79918f1100000000/.da440f13f21ab301-79918f1100000002_588063374_dir/p=�a/da440f13f21ab301-79918f1100000002_782687841_data.0.txt
> Error(5): Input/output error
> select count(*) from tspart;
> -- result: 3, the table looks unchanged
> refresh tspart;
> select count(*) from tspart;
> -- result: 4, because an extra file was found by refresh
> {code}
> While dirty writes is a known issue in non transactional tables, reproducing
> it so easily should be avoided if possible. The problem in this case is that
> the error comes when moving the files, so some files can be already moved to
> their final destination. Detecting the problematic partition names earlier
> could ensure that files written for other partitions are not moved out of
> staging dir.
> https://github.com/apache/impala/blob/f4e75510948bdb72f2d5206161fee12e5b6d0888/be/src/runtime/dml-exec-state.cc#L341
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]