[jira] [Updated] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

Sergey Shelukhin (JIRA) Mon, 07 Nov 2016 17:33:11 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HIVE-15147:
------------------------------------
    Description: 
The primary goal for the first pass is caching text formats. Nothing would 
prevent other formats from using the same path, in principle, although, as was 
originally done with ORC, it may be better to have native caching support 
optimized for each particular format.
Given that caching pure text is not smart, and we already have ORC-encoded 
cache that is columnar due to ORC file structure, we will transform data into 
columnar ORC.
The general idea is to treat all the data in the world as merely ORC that was 
compressed with some poor compression codec, such as csv. Using the original IF 
and serde, as well as an ORC writer (with some heavyweight optimizations 
disabled, potentially), we can "uncompress" the csv/whatever data into its 
"original" ORC representation, then cache it efficiently, by column, and also 
reuse a lot of the existing code.

Various other points:
1) Caching granularity will have to be somehow determined (i.e. how do we slice 
the file horizontally, to avoid caching entire columns). As with ORC 
uncompressed files, the specific offsets don't really matter as long as they 
are consistent between reads. The problem is that the file offsets will 
actually need to be propagated to the new reader from the original inputformat. 
Row counts are easier to use but there's a problem of how to actually map them 
to missing ranges to read from disk.
2) Obviously, for row-based formats, if any one column that is to be read has 
been evicted or is otherwise missing, "all the columns" have to be read for the 
corresponding slice to cache and read that one column. The vague plan is to 
handle this implicitly, similarly to how ORC reader handles CB-RG overlaps - it 
will just so happen that a missing column in disk range list to retrieve will 
expand the disk-range-to-read into the whole horizontal slice of the file.
3) Granularity/etc. won't work for gzipped text. If anything at all is evicted, 
the entire file has to be re-read. Gzipped text is a ridiculous feature, so 
this is by design.
4) In future, it would be possible to also build some form or metadata/indexes 
for this cached data to do PPD, etc. This is out of the scope for now.

  was:
The primary goal for the first pass is caching text formats. Nothing would 
prevent other formats from using the same path, in principle, although, as was 
originally done with ORC, it may be better to have native caching support 
optimized for each particular format.
Given that caching pure text is not smart, and we already have ORC-encoded 
cache that is columnar due to ORC file structure, we will transform data into 
columnar ORC.
The general idea is to treat all the data in the world as merely ORC that was 
compressed with some poor compression codec, such as csv. Using the original IF 
and serde, as well as an ORC writer (with some heavyweight optimizations 
disabled, potentially), we can "uncompress" the csv/whatever data into its 
"original" ORC representation, then cache it efficiently, by column, and also 
reuse a lot of the existing code.

Various other points:
1) Caching granularity will have to be somehow determined (i.e. how do we slice 
the file horizontally, to avoid caching entire columns). As with ORC 
uncompressed files, the specific offsets don't really matter as long as they 
are consistent between reads. The problem is that the file offsets will 
actually need to be propagated to the new reader from the original inputformat. 
Row counts are easier to use but there's a problem of how to actually map them 
to missing ranges to read from disk.
2) Obviously, for row-based formats, if any one column that is to be read has 
been evicted or is otherwise missing, "all the columns" have to be read for the 
corresponding slice to cache and read that one column. The vague plan is to 
handle this implicitly, similarly to how ORC reader handles CB-RG overlaps - it 
will just so happen that a missing column in disk range list to retrieve will 
expand the disk-range-to-read into the whole horizontal slice of the file.
3) Granularity/etc. won't work for gzipped text. If anything at all is evicted, 
the entire file has to be re-read. Gzipped text is a ridiculous feature, so 
this is by design.
4) In future, it would be possible to also build some form or metadata/indexes 
for this cached data to do PPD, etc. This is out of the scope of this stage.


> LLAP: use LLAP cache for non-columnar formats in a somewhat general way
> -----------------------------------------------------------------------
>
>                 Key: HIVE-15147
>                 URL: https://issues.apache.org/jira/browse/HIVE-15147
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>
> The primary goal for the first pass is caching text formats. Nothing would 
> prevent other formats from using the same path, in principle, although, as 
> was originally done with ORC, it may be better to have native caching support 
> optimized for each particular format.
> Given that caching pure text is not smart, and we already have ORC-encoded 
> cache that is columnar due to ORC file structure, we will transform data into 
> columnar ORC.
> The general idea is to treat all the data in the world as merely ORC that was 
> compressed with some poor compression codec, such as csv. Using the original 
> IF and serde, as well as an ORC writer (with some heavyweight optimizations 
> disabled, potentially), we can "uncompress" the csv/whatever data into its 
> "original" ORC representation, then cache it efficiently, by column, and also 
> reuse a lot of the existing code.
> Various other points:
> 1) Caching granularity will have to be somehow determined (i.e. how do we 
> slice the file horizontally, to avoid caching entire columns). As with ORC 
> uncompressed files, the specific offsets don't really matter as long as they 
> are consistent between reads. The problem is that the file offsets will 
> actually need to be propagated to the new reader from the original 
> inputformat. Row counts are easier to use but there's a problem of how to 
> actually map them to missing ranges to read from disk.
> 2) Obviously, for row-based formats, if any one column that is to be read has 
> been evicted or is otherwise missing, "all the columns" have to be read for 
> the corresponding slice to cache and read that one column. The vague plan is 
> to handle this implicitly, similarly to how ORC reader handles CB-RG overlaps 
> - it will just so happen that a missing column in disk range list to retrieve 
> will expand the disk-range-to-read into the whole horizontal slice of the 
> file.
> 3) Granularity/etc. won't work for gzipped text. If anything at all is 
> evicted, the entire file has to be re-read. Gzipped text is a ridiculous 
> feature, so this is by design.
> 4) In future, it would be possible to also build some form or 
> metadata/indexes for this cached data to do PPD, etc. This is out of the 
> scope for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15147) LLAP: use LLAP cache for non-columnar formats in a somewhat general way

Reply via email to