[jira] [Resolved] (ORC-744) LazyIO of non-filter columns

Dongjoon Hyun (Jira) Tue, 31 Aug 2021 21:26:07 -0700


     [ 
https://issues.apache.org/jira/browse/ORC-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun resolved ORC-744.
-------------------------------
    Fix Version/s: 1.7.0
       Resolution: Fixed

I'm resolving this issue because all subtasks are completed.

> LazyIO of non-filter columns
> ----------------------------
>
>                 Key: ORC-744
>                 URL: https://issues.apache.org/jira/browse/ORC-744
>             Project: ORC
>          Issue Type: Improvement
>          Components: Reader
>    Affects Versions: 1.7.0
>            Reporter: Pavan Lanka
>            Assignee: Pavan Lanka
>            Priority: Major
>              Labels: releasenotes
>             Fix For: 1.7.0
>
>         Attachments: image-2021-01-25-14-34-45-375.png
>
>
> h2. Background
> This feature request started as a result of a large search that is performed 
> with the following characteristics:
>  * The search fields are not part of partition, bucket or sort fields.
>  * The table is a very large table.
>  * The predicates result in very few rows compared to the scan size.
>  * The search columns are a significant subset of selection columns in the 
> query.
> Initial analysis showed that we could have a significant benefit by lazily 
> reading the non-search columns only when we have a match. We explore the 
> design and some benchmarks in subsequent sections.
> h2. Design
> This builds further on ORC-577 which currently only restricts deserialization 
> for some selected data types but does not improve on IO.
> On a high level the design includes the following components:
> !image-2021-01-25-14-34-45-375.png!
>  * *SArg to Filter*: Converts Search Arguments passed down into filters for 
> efficient application during scans.
>  * *Read*: Performs the lazy read using the filters.
>  ** *Read Filter Columns*: Read the filter columns from the file.
>  ** *Apply Filter*: Apply the filter on the read filter columns.
>  ** *Read Select Columns*: If filter selects at least a row then read the 
> remaining columns.
>  
> This issue has the following tasks that provides further details on the 
> design of the respective components:
>  # ORC-741: Bug fix related to schema evolution of missing columns in the 
> presence of filters
>  # ORC-742: LazyIO of non-filter columns
>  # ORC-743: Conversion of SArg to Filter
>  
> h2. Tests
> We evaluated this approach against a search job with the following stats:
>  * Table
>  ** Size: ~*420 TB*
>  ** Data fields: ~*120*
>  ** Partition fields: *3*
>  * Scan
>  ** Search fields: 3 data fields with large (~ 1000 value) IN clauses 
> compounded by *OR*.
>  ** Select fields: 16 data fields (includes the 3 search fields), 1 partition 
> field
>  ** Search:
>  *** Size: ~*180 TB*
>  *** Records: *3.99 T*
>  ** Selected:
>  *** Size: ~*100 MB*
>  *** Records: *1 M*
> We have observed the following reductions compared with the absence of the 
> patch:
> ||Test||IO Reduction %||CPU Reduction %||
> |Select 16 columns|45|47|
> |SELECT *|70|87|
>  * The savings are more significant as you increase the number of select 
> columns with respect to the search columns
>  * When the filter selects most data, no significant penalty observed as a 
> result of 2 IO compared with a single IO
>  ** We do have a penalty as a result of the filter application on the 
> selected records.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ORC-744) LazyIO of non-filter columns

Reply via email to