Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 )
Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters ...................................................................... Patch Set 10: (4 comments) http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util-test.cc@94 PS10, Line 94: TestUrl("/home/impala/directory/", "%2Fhome%2Fimpala%2Fdirectory%2F", false); : TestUrl("/home/impala/directory/", "%2Fhome%2Fimpala%2Fdirectory%2F", true); Please add some tests like these for newly added marks like '\n', '\r', '^', etc. EDIT: please include all the escaped characters. http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc@a40 PS10, Line 40: : : : Can you keep this comment? We should still mention common/src/java/org/apache/hadoop/hive/common/FileUtils.java since the list would change there. Also mention here that "Impala creates the partition directories using these encoded strings. So it's crucial to keep it consistent with Hive." http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc@45 PS10, Line 45: static const std::unordered_set<char> SpecialCharacters = {'"', '#', '\\', '*', '/', : ':', '=', '?', '%', '^', '[', ']', '{', '\x7F'}; We are still missing several characters here, e.g. single quote, '\n' and '\r'. I think we should use the same list as Hive's. The list in Hive consists of control characters, i.e. ASCII characters from 0 to 31 and 127 (0x7F), and marks like '"', '#', '%', '\'', '*', '/', ':', '=', '?', '\\', '{', '[', ']', '^'. Code snipper of Hive: static BitSet charToEscape = new BitSet(128); static { for (char c = 0; c < ' '; c++) { charToEscape.set(c); } /** * ASCII 01-1F are HTTP control characters that need to be escaped. * \u000A and \u000D are \n and \r, respectively. */ char[] clist = new char[] {'\u0001', '\u0002', '\u0003', '\u0004', '\u0005', '\u0006', '\u0007', '\u0008', '\u0009', '\n', '\u000B', '\u000C', '\r', '\u000E', '\u000F', '\u0010', '\u0011', '\u0012', '\u0013', '\u0014', '\u0015', '\u0016', '\u0017', '\u0018', '\u0019', '\u001A', '\u001B', '\u001C', '\u001D', '\u001E', '\u001F', '"', '#', '%', '\'', '*', '/', ':', '=', '?', '\\', '\u007F', '{', '[', ']', '^'}; for (char c : clist) { charToEscape.set(c); } Note that Hive adds some characters twice. ASCII code of ' ' is 32. So in the first loop, 0-31 are already added. In Java, char is a "16-bit integer" that represents a single 16-bit Unicode character. So Hive code uses unicodes like '\u007F'. But in C++, such unicodes are represented by UTF-8 bytes. And char in C++ is just 8 bits. So in the original code "\u00FF" is actually two chars. http://gerrit.cloudera.org:8080/#/c/21131/10/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: http://gerrit.cloudera.org:8080/#/c/21131/10/testdata/workloads/functional-query/queries/QueryTest/insert.test@452 PS10, Line 452: INSERT INTO TABLE insert_string_partitioned PARTITION(s2="_.~ +") : SELECT "value" FROM functional.alltypessmall LIMIT 1; : ---- RESULTS : s2=_.~ +: 1 : ==== : ---- QUERY : # select with unencoded partition key : SELECT * FROM insert_string_partitioned; : ---- RESULTS : 'value','_.~ +' : ---- TYPES : string, string e2e tests for ASCII characters can be added here, e.g. INSERT INTO TABLE insert_string_partitioned PARTITION(s2="O'Reilly") values ('0'); Or use a string with all escaped marks. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 10 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zihao Ye <[email protected]> Gerrit-Comment-Date: Sun, 28 Apr 2024 02:12:16 +0000 Gerrit-HasComments: Yes
