wypoon opened a new pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895
 
 
   ### What changes were proposed in this pull request?
   
   When querying a partitioned table with format 
`org.apache.hive.hcatalog.data.JsonSerDe` and more than one task runs in each 
executor concurrently, the following exception is encountered:
   
   `java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
org.apache.hive.hcatalog.data.HCatRecord`
   
   The exception occurs in `HadoopTableReader.fillObject`.
   
   `org.apache.hive.hcatalog.data.JsonSerDe#initialize` populates a 
`cachedObjectInspector` field by calling 
`HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector`, which is not 
thread-safe; this `cachedObjectInspector` is returned by 
`JsonSerDe#getObjectInspector`.
   
   We protect against this Hive bug by synchronizing on an object when we need 
to call `initialize` on `org.apache.hadoop.hive.serde2.Deserializer` instances 
(which may be `JsonSerDe` instances). By doing so, the `ObjectInspector` for 
the `Deserializer` of the partitions of the JSON table and that of the table 
`SerDe` are the same cached `ObjectInspector` and 
`HadoopTableReader.fillObject` then works correctly. (If the `ObjectInspector`s 
are different, then a bug in `HCatRecordObjectInspector` causes an `ArrayList` 
to be created instead of an `HCatRecord`, resulting in the `ClassCastException` 
that is seen.)
   
   ### Why are the changes needed?
   
   To avoid a bug in Hive.
   
   ### Does this PR introduce any user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Tested manually on a cluster with a partitioned JSON table and running a 
query using more than one core per executor. Before this change, the 
ClassCastException happens consistently. With this change it does not happen.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to