[ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HIVE-15147:
------------------------------------
    Attachment: HIVE-15147.WIP.noout.patch

Very early WIP patch. This adds a test with a huge out file, I am excluding the 
out file for now, since it is 2Mb and will change before commit.

This contains the requisite infrastructure and the basic pipeline that seems to 
work. Main remaining items:
1) Wire up actual cache instead of just allocator, and reenable refcount usage.
2) Unexpected problem - determine what to do with vectorizability. The problem 
is that right now we can vectorize the pipeline for any random 
InputFormat/serde only as long as we can run it in LLAP with this; but we only 
decide to run in LLAP if the pipeline is vectorized. So this creates a catch-22 
- we assume in vectorizer we will run in LLAP, but if we won't for some reason, 
we need to go back and unvectorize; or, if we decide on LLAP status first, we'd 
have to trust in vectorizer. Perhaps we can have LLAP pre-decider before 
vectorization. Alternatively, we can have a converter that will have the same 
logic as the LLAP IO change, from the operators' vantage point - so, even if we 
cannot use LLAP IO, we can still run the vectorized pipeline. [~mmccline] has 
some feature that allows one to vectorize the non-vectorizable inputs IIRC, we 
can just use that.
3) Decide on how to split the file horizontally. Right now the entire text file 
is treated as a single giant RG. File offsets would need to be exposed somehow; 
then we can cache data based on those and use those before reading. It's 
trivial to trick the source IF into reading the parts of the file that supports 
that (e.g. text) - by faking the splits with the intended offsets. However, 
it's hard to get the offsets. We might need to override or C/P the text input 
format/RR for that. 
4) Optionally, go above what (3) would require in terms of metadata cache.

cc [~gopalv]

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15147
>                 URL: https://issues.apache.org/jira/browse/HIVE-15147
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-15147.WIP.noout.patch
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to