[
https://issues.apache.org/jira/browse/HIVE-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692715#comment-14692715
]
Sergey Shelukhin commented on HIVE-11245:
-----------------------------------------
Most of the work was done in 3 sub-tasks.
1) 3 groups of things were added to storage API.
a) DiskRange; ORC already depends on it, so it was an oversight on master that
it was not moved to storage-api. It has been moved on llap branch.
b) EncodedColumnBatch and MemoryBuffer. Same as moving VRB and *ColumnVector
for encoded data.
c) DataCache, Pool and Allocator APIs (the only import in any of them is
MemoryBuffer, so they are very generic). The right place to implement
format-agnostic cache, allocator, and object pool is Hive, and input formats
can use these deep inside the core functionality, where Hive has no insight.
Therefore it makes sense to have connective interfaces.
2) ....orc.encoded package was created with full separate path for "record
reader", as discussed, although I don't think it was necessary. That required
making some things in RecordReaderUtils, etc. public because Java visibility
model is stupid.
It contains 9 files, most of which are very small.
* EncodedOrcFile - equivalent to OrcFile, static factory for Reader.
* Reader - interface, equivalent to orc.Reader, produces EncodedReader.
* EncodedReader - interface, equivalent to RecordReader (although not in
signatures), for reading encoded data.
* Consumer - interface used in EncodedReader call to return data asynchronously
(logically, a queue for returned data with "done" and "error" markers).
* OrcBatchKey, OrcCacheKey - simple DSes to use as keys when passing data and
for cache.
* ReaderImpl - equivalent to orc.ReaderImpl, the Reader interface
implementation.
* EncodedReaderImpl - equivalent to RecordReaderImpl (although not in
signatures), main class that contains the code. Package-private, so it's not
even visible.
* CacheChunk - part of EncodedReaderImpl that has to be visible for tests, so
it's in separate file.
3) The remaining item is moving TreeReader bits that depend on orc.encoded
package, into encoded package. Myself or [~prasanth_j] can do this.
> LLAP: Fix the LLAP to ORC APIs
> ------------------------------
>
> Key: HIVE-11245
> URL: https://issues.apache.org/jira/browse/HIVE-11245
> Project: Hive
> Issue Type: Sub-task
> Reporter: Owen O'Malley
> Assignee: Sergey Shelukhin
> Priority: Blocker
>
> Currently the LLAP branch has refactored the ORC code to have different code
> paths depending on whether the data is coming from the cache or a FileSystem.
> We need to introduce a concept of a DataSource that is responsible for
> getting the necessary bytes regardless of whether they are coming from a
> FileSystem, in memory cache, or both.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)