Ádám Szita created HIVE-23729:
---------------------------------

             Summary: LLAP text cache fails when using multiple tables/schemas 
on the same files
                 Key: HIVE-23729
                 URL: https://issues.apache.org/jira/browse/HIVE-23729
             Project: Hive
          Issue Type: Bug
            Reporter: Ádám Szita
            Assignee: Ádám Szita


When using the text based cache we will hit exceptions in the following case:
 * Table A with 3 columns is defined on location X (where we have text based 
data files)
 * Table B with 2 columns is defined on the same location X
 * User runs a query on table A, thereby filling the LLAP cache.
 * If the next query goes against table B that has a different schema, LLAP 
will throw an error:

{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
 at 
org.apache.hadoop.hive.llap.cache.SerDeLowLevelCacheImpl.getCacheDataForOneSlice(SerDeLowLevelCacheImpl.java:411)
 at 
org.apache.hadoop.hive.llap.cache.SerDeLowLevelCacheImpl.getFileData(SerDeLowLevelCacheImpl.java:389)
 at 
org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.readFileWithCache(SerDeEncodedDataReader.java:819)
 at 
org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.performDataRead(SerDeEncodedDataReader.java:720)
 at 
org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:274)
 at 
org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:271)
 {code}
This is because the cache lookup is based on file ID, which in this case is the 
same for both tables. However, unlike with ORC files, the cached content and 
the file content is different, as it is dependent on the schema that was 
defined by the user. That's because the original text content is encoded into 
ORC in the cache.

I think for the text cache case we will need to extend the cache key from being 
just the simple file ID to something that tracks the schema too. This will 
result in caching the *same* *file* *content* multiple times (if there are 
multiple schemas like this), however as we can see the *cached content itself 
could be quite different* (e.g. different streams with different encodings), 
and in turn we gain correctness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to