[
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836458#comment-17836458
]
Zihao Ye commented on IMPALA-12927:
-----------------------------------
Hi, [~csringhofer], I briefly looked through the related Hive code and didn't
find any special handling for potential special characters in the rawstring
case, it simply uses the Jackson library for reading and writing.
It appears to assume that the data does not involve any special characters,
suggesting that users should encode binary data in some other similar way to
base64 to ensure it doesn't contain special characters when using the rawstring.
Therefore, I think it's feasible to proceed with the previously mentioned
approach, which is to treat "json.binary.format" as the “AuxColumnType” for
json binary columns. If this table property is not set, then disable reading of
binary columns in that json table and provide an error message.
> Support reading BINARY columns in JSON tables
> ---------------------------------------------
>
> Key: IMPALA-12927
> URL: https://issues.apache.org/jira/browse/IMPALA-12927
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Affects Versions: Impala 4.3.0
> Reporter: Csaba Ringhofer
> Assignee: Zihao Ye
> Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> +----+--------------+------------+
> | id | string_col | binary_col |
> +----+--------------+------------+
> | 1 | ascii | NULL |
> | 2 | ascii | NULL |
> | 3 | null | NULL |
> | 4 | empty | |
> | 5 | valid utf8 | NULL |
> | 6 | valid utf8 | NULL |
> | 7 | invalid utf8 | NULL |
> | 8 | invalid utf8 | NULL |
> +----+--------------+------------+
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col,
> type: STRING, data: 'binary1'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: 'binary2'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: 'árvíztűrőtükörfúró'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: '你好hello'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: '��'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING,
> data: '�D3"'
> Error parsing row: file:
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before
> offset: 481
> {code}
> The single file in the table looks like this:
> {code}
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0
> {"id":1,"string_col":"ascii","binary_col":"binary1"}
> {"id":2,"string_col":"ascii","binary_col":"binary2"}
> {"id":3,"string_col":"null","binary_col":null}
> {"id":4,"string_col":"empty","binary_col":""}
> {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
> {"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
> {"id":7,"string_col":"invalid utf8","binary_col":"\u0000�\u0000�"}
> {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u0000"}
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]