[ 
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829530#comment-17829530
 ] 

Ye Zihao commented on IMPALA-12927:
-----------------------------------

[~csringhofer] Yes, the BINARY test was explicitly skipped at the time because 
the second case in the 'binary-test.test' could not pass, with the final 
hexadecimal result not matching the expected one. Since this was not the 
priority content supported by that patch, I chose to temporarily skip this test.

I haven't carefully investigated the reason for the unmatching yet, but I guess 
it's related to rapidjson::Reader's default use of UTF8 encoding, meaning 
binary data would be parsed as strings in UTF8 encoding, leading to subsequent 
conversion issues. At least, if we use the string type of 'AuxColumnType' when 
calling 'WriteSlot', the first 6 rows in 'binary_tbl_json' can be read 
correctly.

I still need some time to verify my guess. If you have any thoughts, please let 
me know, thanks.

> Support reading BINARY columns in JSON tables
> ---------------------------------------------
>
>                 Key: IMPALA-12927
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12927
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Ye Zihao
>            Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive 
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> +----+--------------+------------+
> | id | string_col   | binary_col |
> +----+--------------+------------+
> | 1  | ascii        | NULL       |
> | 2  | ascii        | NULL       |
> | 3  | null         | NULL       |
> | 4  | empty        |            |
> | 5  | valid utf8   | NULL       |
> | 6  | valid utf8   | NULL       |
> | 7  | invalid utf8 | NULL       |
> | 8  | invalid utf8 | NULL       |
> +----+--------------+------------+
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, 
> type: STRING, data: 'binary1'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'binary2'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'árvíztűrőtükörfúró'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '你好hello'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '��'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '�D3"'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> {code}
> The single file in the table looks like this:
> {code}
>  hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0
> {"id":1,"string_col":"ascii","binary_col":"binary1"}
> {"id":2,"string_col":"ascii","binary_col":"binary2"}
> {"id":3,"string_col":"null","binary_col":null}
> {"id":4,"string_col":"empty","binary_col":""}
> {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
> {"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
> {"id":7,"string_col":"invalid utf8","binary_col":"\u0000�\u0000�"}
> {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u0000"}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to