[
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829530#comment-17829530
]
Ye Zihao commented on IMPALA-12927:
-----------------------------------
[~csringhofer] Yes, the BINARY test was explicitly skipped at the time because
the second case in the 'binary-test.test' could not pass, with the final
hexadecimal result not matching the expected one. Since this was not the
priority content supported by that patch, I chose to temporarily skip this test.
I haven't carefully investigated the reason for the unmatching yet, but I guess
it's related to rapidjson::Reader's default use of UTF8 encoding, meaning
binary data would be parsed as strings in UTF8 encoding, leading to subsequent
conversion issues. At least, if we use the string type of 'AuxColumnType' when
calling 'WriteSlot', the first 6 rows in 'binary_tbl_json' can be read
correctly.
I still need some time to verify my guess. If you have any thoughts, please let
me know, thanks.
> Support reading BINARY columns in JSON tables
> ---------------------------------------------
>
> Key: IMPALA-12927
> URL: https://issues.apache.org/jira/browse/IMPALA-12927
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Reporter: Csaba Ringhofer
> Assignee: Ye Zihao
> Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> +----+--------------+------------+
> | id | string_col | binary_col |
> +----+--------------+------------+
> | 1 | ascii | NULL |
> | 2 | ascii | NULL |
> | 3 | null | NULL |
> | 4 | empty | |
> | 5 | valid utf8 | NULL |
> | 6 | valid utf8 | NULL |
> | 7 | invalid utf8 | NULL |
> | 8 | invalid utf8 | NULL |
> +----+--------------+------------+
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col,
> type: STRING, data: 'binary1'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: 'binary2'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: 'árvíztűrőtükörfúró'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: '你好hello'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: '��'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: '�D3"'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> {code}
> The single file in the table looks like this:
> {code}
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0
> {"id":1,"string_col":"ascii","binary_col":"binary1"}
> {"id":2,"string_col":"ascii","binary_col":"binary2"}
> {"id":3,"string_col":"null","binary_col":null}
> {"id":4,"string_col":"empty","binary_col":""}
> {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
> {"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
> {"id":7,"string_col":"invalid utf8","binary_col":"\u0000�\u0000�"}
> {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u0000"}
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]