[
https://issues.apache.org/jira/browse/HIVE-28632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko resolved HIVE-28632.
-----------------------------------
Fix Version/s: 4.1.0
Resolution: Fixed
> Fix issues in JSON SerDe implementations related to Boolean, Binary data types
> ------------------------------------------------------------------------------
>
> Key: HIVE-28632
> URL: https://issues.apache.org/jira/browse/HIVE-28632
> Project: Hive
> Issue Type: Bug
> Security Level: Public(Viewable by anyone)
> Reporter: Araika Singh
> Assignee: Araika Singh
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.1.0
>
>
> *Boolean Handling:* Prior to HIVE-21240, the JSON deserializer converted
> boolean strings without considering case sensitivity, resulting in {{false}}
> for any non-standard strings. Now, this behavior is case-sensitive, resulting
> in {{false}} for all non-conforming strings. To follow SQL standards, any
> string other than "false" (case-insensitive) should be treated as
> {{{}true{}}}. There are also inconsistencies between outputs when using an
> INSERT statement (which follows SQL standards) and loading a JSON value
> directly. These inconsistencies need to be fixed to ensure uniform output.
> *Binary Handling:* Previously, Hive parsed JSON nodes as raw strings,
> allowing users to directly load JSON text to achieve desired results. Post
> HIVE-21240, Hive now mandates base64-encoded JSON texts, throwing errors for
> any other format. It's important to note that INSERT statements work
> seamlessly as the values are serialized to base64 and then deserialized. Hive
> should be capable of detecting if a JSON node is base64-encoded or not; if
> not, it should default to deserializing as a raw string.
> *+To enhance user experience and maintain SQL standard compliance, we need to
> address the following:+*
> # Ensure case-sensitive handling of boolean strings in JSON deserialization
> and resolve inconsistencies between INSERT statements and direct JSON loading.
> # Implement automatic detection and appropriate deserialization of JSON
> nodes, distinguishing between base64-encoded and raw strings.
> *+Steps to replicate boolean handling issue:+*
> {code:java}
> // values for json: "booleancaseinsensitive" : "TrUE", "booleanstring" :
> "true", "booleanboolean" : true, "stringfalse" : "FaLSE", "somestring" :
> "somestringhere", "booleannull" : null, "booleannumfalse" : 0,
> "booleannumtrue" : -1
> create table json_serde_bool_table (
> booleancaseinsensitive boolean,
> booleanstring boolean,
> booleanboolean boolean,
> stringfalse boolean,
> somestring boolean,
> booleannull boolean,
> booleannumfalse boolean,
> booleannumtrue boolean)
> row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe';
> LOAD DATA INPATH '/warehouse/files/jsonserde.txt' INTO TABLE
> json_serde_bool_table;
> select * from json_serde_bool_table;{code}
> +_Output for the above:_+
> {noformat}
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+
> | booleancaseinsensitive | booleanstring | booleanboolean | stringfalse |
> somestring | booleannull | booleannumfalse | booleannumtrue |
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+
> | false | true | true | false |
> false | NULL | false | true |
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+{noformat}
> {code:java}
> truncate table json_serde_bool_table;
> INSERT INTO TABLE json_serde_bool_table VALUES ("TrUE", "true", true,
> "FaLSE", "somestringhere", null, 0, -1);
> select * from json_serde_bool_table;{code}
> +_Output for the above:_+
> {noformat}
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+
> | booleancaseinsensitive | booleanstring | booleanboolean | stringfalse |
> somestring | booleannull | booleannumfalse | booleannumtrue |
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+
> | true | true | true | false |
> true | NULL | false | true |
> +-------------------------+----------------+-----------------+--------------+-------------+--------------+------------------+-----------------+{noformat}
> *+Steps to replicate binary handling issue:+*
> {code:java}
> create table json_serde_table (
> binarycolumn1 binary,
> binarycolumn2 binary,
> binarycolumn3 binary,
> binarycolumn4 binary,
> binarycolumn5 binary,
> binarycolumn6 binary
> )
> row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe';
> LOAD DATA INPATH '/path/to/jsonserde.txt' INTO TABLE json_serde_table; // the
> concerned json text show be normal text(not base64)
> select * from json_serde_table; {code}
> _+Error stack trace:+_
> {code:java}
> Failed with exception
> java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
> org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary
> type
> from record.
> java.io.IOException:
> org.apache.hadoop.hive.serde2.SerDeException:
> org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary
> type
> from record.
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:628)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:535)
> at
> org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:194)
> at
> org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:95)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
> at
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236)
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:336)
> at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:356)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> at
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> Caused
> by: org.apache.hadoop.hive.serde2.SerDeException:
> org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary
> type
> from record.
> at
> org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:191)
> at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:609)
> ... 18 more
> Caused
> by: org.apache.hadoop.hive.serde2.SerDeException: Error generating JSON binary
> type from record.
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:464)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitLeafNode(HiveJsonReader.java:418)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:224)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitStructNode(HiveJsonReader.java:335)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.visitNode(HiveJsonReader.java:229)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.parseStruct(HiveJsonReader.java:206)
> at
> org.apache.hadoop.hive.serde2.JsonSerDe.deserialize(JsonSerDe.java:187)
> ... 19 more
> Caused
> by: org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException:
> Cannot access contents of TextNode as binary due to broken Base64 encoding:
> Unexpected end of base64-encoded String: base64 variant 'MIME-NO-LINEFEEDS'
> expects padding (one or more '=' characters) at the end. This Base64Variant
> might have been incorrectly configured
> at
> org.apache.hive.com.fasterxml.jackson.databind.exc.InvalidFormatException.from(InvalidFormatException.java:67)
> at
> org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.getBinaryValue(TextNode.java:78)
> at
> org.apache.hive.com.fasterxml.jackson.databind.node.TextNode.binaryValue(TextNode.java:89)
> at
> org.apache.hadoop.hive.serde2.json.HiveJsonReader.getByteValue(HiveJsonReader.java:458)
> ... 25 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)