[
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823555#comment-15823555
]
Gopal V edited comment on HIVE-15147 at 1/16/17 7:12 AM:
---------------------------------------------------------
LGTM - +1 tests pending.
The cached hit-rate makes a dramatic improvement to the performance, but
there's a cliff of performance loss whenever data gets evicted (or for the
initial load-rate).
Running tpc-h Q1 on 10Gb of data on 1 node, with 50x gains between the 1st and
2nd run.
{code}
1st run : Time taken: 102.598 seconds, Fetched: 1 row(s)
2nd run: Time taken: 2.674 seconds, Fetched: 1 row(s)
{code}
Further improvements to target in later work.
Most of the time in the 1st run is spent in compressing incompressible string
columns, followed by a class inheritance check inside WriterImpl::addRow()
repnz and then the LazyStruct::parse() cache-misses (LazySimpleDeserializeRead
to be used).
!perf-top-cache.png!
!writerimpl-addrow.png!
I'm also attaching [^pre-cache.svg] call-tree with weights.
was (Author: gopalv):
LGTM - +1.
The cached hit-rate makes a dramatic improvement to the performance, but
there's a cliff of performance loss whenever data gets evicted (or for the
initial load-rate).
Running tpc-h Q1 on 10Gb of data on 1 node, with 50x gains between the 1st and
2nd run.
{code}
1st run : Time taken: 102.598 seconds, Fetched: 1 row(s)
2nd run: Time taken: 2.674 seconds, Fetched: 1 row(s)
{code}
Further improvements to target in later work.
Most of the time in the 1st run is spent in compressing incompressible string
columns, followed by a class inheritance check inside WriterImpl::addRow()
repnz and then the LazyStruct::parse() cache-misses (LazySimpleDeserializeRead
to be used).
!perf-top-cache.png!
!writerimpl-addrow.png!
I'm also attaching [^pre-cache.svg] call-tree with weights.
> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
> Key: HIVE-15147
> URL: https://issues.apache.org/jira/browse/HIVE-15147
> Project: Hive
> Issue Type: New Feature
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Attachments: HIVE-15147.01.patch, HIVE-15147.patch,
> HIVE-15147.WIP.noout.patch
>
>
> The primary goal for the first pass is caching text files. Nothing would
> prevent other formats from using the same path, in principle, although, as
> was originally done with ORC, it may be better to have native caching support
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded
> cache that is columnar due to ORC file structure, we will transform data into
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was
> compressed with some poor compression codec, such as csv. Using the original
> IF and serde, as well as an ORC writer (with some heavyweight optimizations
> disabled, potentially), we can "uncompress" the csv/whatever data into its
> "original" ORC representation, then cache it efficiently, by column, and also
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we
> slice the file horizontally, to avoid caching entire columns). As with ORC
> uncompressed files, the specific offsets don't really matter as long as they
> are consistent between reads. The problem is that the file offsets will
> actually need to be propagated to the new reader from the original
> inputformat. Row counts are easier to use but there's a problem of how to
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has
> been evicted or is otherwise missing, "all the columns" have to be read for
> the corresponding slice to cache and read that one column. The vague plan is
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps
> - it will just so happen that a missing column in disk range list to retrieve
> will expand the disk-range-to-read into the whole horizontal slice of the
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or
> metadata/indexes for this cached data to do PPD, etc. This is out of the
> scope for now.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)