[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

Prasanth J (JIRA) Tue, 28 Jan 2014 11:51:51 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Prasanth J updated HIVE-6287:
-----------------------------

    Attachment: HIVE-6287.3.patch

Patch number should be .3. Reuploading it.

> batchSize computation in Vectorized ORC reader can cause 
> BufferUnderFlowException when PPD is enabled
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-6287
>                 URL: https://issues.apache.org/jira/browse/HIVE-6287
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile, vectorization
>         Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
> HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe 
> boundaries. This will not work when predicate pushdown (PPD) in ORC is 
> enabled as PPD works at row group level (stripe contains multiple row 
> groups). By default, row group stride is 10000. When PPD is enabled, some row 
> groups may get eliminated. After row group elimination, disk ranges are 
> computed based on the selected row groups. If batchSize computation is not 
> aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
> range). Following scenario should illustrate it more clearly
> {code}
> |--------------------------------- STRIPE 1 
> ------------------------------------|
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
> --|
>                 |--------- diskrange 1 ---------|               |- diskrange 
> 2 -|
>                                                 ^
>                                              (marker)   
> {code}
> diskrange1 will have 20000 rows and diskrange 2 will have 10000 rows. Since 
> nextBatch() was not aware of row groups and hence the diskranges, it tries to 
> read 1024 values from the end of diskrange 1 where it should only read 20000 
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is 
> computed accordingly. {code}batchSize = 
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
> rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

Reply via email to