[
https://issues.apache.org/jira/browse/HIVE-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-23729:
----------------------------------
Labels: pull-request-available (was: )
> LLAP text cache fails when using multiple tables/schemas on the same files
> --------------------------------------------------------------------------
>
> Key: HIVE-23729
> URL: https://issues.apache.org/jira/browse/HIVE-23729
> Project: Hive
> Issue Type: Bug
> Reporter: Ádám Szita
> Assignee: Ádám Szita
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When using the text based cache we will hit exceptions in the following case:
> * Table A with 3 columns is defined on location X (where we have text based
> data files)
> * Table B with 2 columns is defined on the same location X
> * User runs a query on table A, thereby filling the LLAP cache.
> * If the next query goes against table B that has a different schema, LLAP
> will throw an error:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
> at
> org.apache.hadoop.hive.llap.cache.SerDeLowLevelCacheImpl.getCacheDataForOneSlice(SerDeLowLevelCacheImpl.java:411)
> at
> org.apache.hadoop.hive.llap.cache.SerDeLowLevelCacheImpl.getFileData(SerDeLowLevelCacheImpl.java:389)
> at
> org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.readFileWithCache(SerDeEncodedDataReader.java:819)
> at
> org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.performDataRead(SerDeEncodedDataReader.java:720)
> at
> org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:274)
> at
> org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:271)
> {code}
> This is because the cache lookup is based on file ID, which in this case is
> the same for both tables. However, unlike with ORC files, the cached content
> and the file content is different, as it is dependent on the schema that was
> defined by the user. That's because the original text content is encoded into
> ORC in the cache.
> I think for the text cache case we will need to extend the cache key from
> being just the simple file ID to something that tracks the schema too. This
> will result in caching the *same* *file* *content* multiple times (if there
> are multiple schemas like this), however as we can see the *cached content
> itself could be quite different* (e.g. different streams with different
> encodings), and in turn we gain correctness.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)