[jira] [Comment Edited] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

Gopal V (JIRA) Sun, 15 Jan 2017 23:12:52 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823555#comment-15823555
 ]


Gopal V edited comment on HIVE-15147 at 1/16/17 7:12 AM:
---------------------------------------------------------

LGTM - +1 tests pending.

The cached hit-rate makes a dramatic improvement to the performance, but 
there's a cliff of performance loss whenever data gets evicted (or for the 
initial load-rate).

Running tpc-h Q1 on 10Gb of data on 1 node, with 50x gains between the 1st and 
2nd run.

{code}
1st run : Time taken: 102.598 seconds, Fetched: 1 row(s)
2nd run: Time taken: 2.674 seconds, Fetched: 1 row(s)
{code}

Further improvements to target in later work.

Most of the time in the 1st run is spent in compressing incompressible string 
columns, followed by a class inheritance check inside WriterImpl::addRow() 
repnz and then the LazyStruct::parse() cache-misses (LazySimpleDeserializeRead 
to be used).

!perf-top-cache.png!

!writerimpl-addrow.png!

I'm also attaching [^pre-cache.svg] call-tree with weights.


was (Author: gopalv):
LGTM - +1.

The cached hit-rate makes a dramatic improvement to the performance, but 
there's a cliff of performance loss whenever data gets evicted (or for the 
initial load-rate).

Running tpc-h Q1 on 10Gb of data on 1 node, with 50x gains between the 1st and 
2nd run.

{code}
1st run : Time taken: 102.598 seconds, Fetched: 1 row(s)
2nd run: Time taken: 2.674 seconds, Fetched: 1 row(s)
{code}

Further improvements to target in later work.

Most of the time in the 1st run is spent in compressing incompressible string 
columns, followed by a class inheritance check inside WriterImpl::addRow() 
repnz and then the LazyStruct::parse() cache-misses (LazySimpleDeserializeRead 
to be used).

!perf-top-cache.png!

!writerimpl-addrow.png!

I'm also attaching [^pre-cache.svg] call-tree with weights.

> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15147
>                 URL: https://issues.apache.org/jira/browse/HIVE-15147
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-15147.01.patch, HIVE-15147.patch, 
> HIVE-15147.WIP.noout.patch
>
>
> The primary goal for the first pass is caching text files. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

Reply via email to