[ 
https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948226#comment-17948226
 ] 

ASF subversion and git services commented on IMPALA-12927:
----------------------------------------------------------

Commit faf322dd414c7d93bf5d6bd8e1a83113c19e9c7d in impala's branch 
refs/heads/master from Eyizoha
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=faf322dd4 ]

IMPALA-12927: Support specifying format for reading JSON BINARY columns

Currently, Impala always assumes that the data in the binary columns of
JSON tables is base64 encoded. However, before HIVE-21240, Hive wrote
binary data to JSON tables without base64 encoding it, instead writing
it as escaped strings. After HIVE-21240, Hive defaults to base64
encoding binary data when writing to JSON tables and introduces the
serde property 'json.binary.format' to indicate the encoding method of
binary data in JSON tables.

To maintain consistency with Hive and avoid correctness issues caused by
reading data in an incorrect manner, this patch also introduces the
serde property 'json.binary.format' to specify the reading method for
binary data in JSON tables. Currently, this property supports reading in
either base64 or rawstring formats, same as Hive.

Additionally, this patch introduces a query option 'json_binary_format'
to achieve the same effect. This query option will only take effect for
JSON tables where the serde property 'json.binary.format' is not set.
The reading format of binary columns in JSON tables can be configured
globally by setting the 'default_query_options'. It should be noted that
the default value of 'json_binary_format' is 'NONE', and impala will
prohibit reading binary columns of JSON tables that either have
"no 'json.binary.format' set and 'json_binary_format' is 'NONE'" or
"an invalid 'json.binary.format' value set", and will provide an error
message to avoid using an incorrect format without the user noticing.

Testing:
  - Enabled existing binary type E2E tests for JSON tables
  - Added new E2E test for 'json.binary.format'

Change-Id: Idf61fa3afc0f33caa63fbc05393e975733165e82
Reviewed-on: http://gerrit.cloudera.org:8080/22289
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Support reading BINARY columns in JSON tables
> ---------------------------------------------
>
>                 Key: IMPALA-12927
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12927
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Csaba Ringhofer
>            Assignee: Zihao Ye
>            Priority: Major
>
> Currently Impala cannot read BINARY columns in JSON files written by Hive 
> correctly and returns runtime errors:
> {code}
> select * from functional_json.binary_tbl;
> +----+--------------+------------+
> | id | string_col   | binary_col |
> +----+--------------+------------+
> | 1  | ascii        | NULL       |
> | 2  | ascii        | NULL       |
> | 3  | null         | NULL       |
> | 4  | empty        |            |
> | 5  | valid utf8   | NULL       |
> | 6  | valid utf8   | NULL       |
> | 7  | invalid utf8 | NULL       |
> | 8  | invalid utf8 | NULL       |
> +----+--------------+------------+
> WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, 
> type: STRING, data: 'binary1'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'binary2'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: 'árvíztűrőtükörfúró'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '你好hello'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '��'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> Error converting column: functional_json.binary_tbl.binary_col, type: STRING, 
> data: '�D3"'
> Error parsing row: file: 
> hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0, before 
> offset: 481
> {code}
> The single file in the table looks like this:
> {code}
>  hdfs://localhost:20500/test-warehouse/binary_tbl_json/000000_0
> {"id":1,"string_col":"ascii","binary_col":"binary1"}
> {"id":2,"string_col":"ascii","binary_col":"binary2"}
> {"id":3,"string_col":"null","binary_col":null}
> {"id":4,"string_col":"empty","binary_col":""}
> {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"}
> {"id":6,"string_col":"valid utf8","binary_col":"你好hello"}
> {"id":7,"string_col":"invalid utf8","binary_col":"\u0000�\u0000�"}
> {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u0000"}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to