[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-07-31 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Description: 
nextBatch() method that computes the batchSize is only aware of stripe 
boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled 
as PPD works at row group level (stripe contains multiple row groups). By 
default, row group stride is 1. When PPD is enabled, some row groups may 
get eliminated. After row group elimination, disk ranges are computed based on 
the selected row groups. If batchSize computation is not aware of this, it will 
lead to BufferUnderFlowException (reading beyond disk range). Following 
scenario should illustrate it more clearly

{code}
|- STRIPE 1 
|
|-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
--|
|- diskrange 1 -|   |- diskrange 2 
-|
^
 (marker)   
{code}

diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
nextBatch() was not aware of row groups and hence the diskranges, it tries to 
read 1024 values from the end of diskrange 1 where it should only read 2 % 
1024 = 544 values. This will result in BufferUnderFlowException.

To fix this, a marker is placed at the end of each range and batchSize is 
computed accordingly. {code}batchSize = 
Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code}

Stack trace will look like:
{code}
Caused by: java.nio.BufferUnderflowException
at java.nio.Buffer.nextGetIndex(Buffer.java:492)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:135)
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:207)
at 
org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readFloat(SerializationUtils.java:70)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$FloatTreeReader.nextVector(RecordReaderImpl.java:673)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.nextVector(RecordReaderImpl.java:1615)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.nextBatch(RecordReaderImpl.java:2883)
at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:94)
... 15 more
{code}

  was:
nextBatch() method that computes the batchSize is only aware of stripe 
boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled 
as PPD works at row group level (stripe contains multiple row groups). By 
default, row group stride is 1. When PPD is enabled, some row groups may 
get eliminated. After row group elimination, disk ranges are computed based on 
the selected row groups. If batchSize computation is not aware of this, it will 
lead to BufferUnderFlowException (reading beyond disk range). Following 
scenario should illustrate it more clearly

{code}
|- STRIPE 1 
|
|-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
--|
|- diskrange 1 -|   |- diskrange 2 
-|
^
 (marker)   
{code}

diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
nextBatch() was not aware of row groups and hence the diskranges, it tries to 
read 1024 values from the end of diskrange 1 where it should only read 2 % 
1024 = 544 values. This will result in BufferUnderFlowException.

To fix this, a marker is placed at the end of each range and batchSize is 
computed accordingly. {code}batchSize = 
Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code}


 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Fix For: 0.13.0

 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
 HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. 

[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-03-23 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated HIVE-6287:
---

Fix Version/s: 0.13.0

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Fix For: 0.13.0

 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
 HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-31 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated HIVE-6287:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks [~prasanth_j]!

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
 HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-30 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Attachment: HIVE-6287.4.patch

The test failure was related to rounding issues in sum(). Its producing 
unpredictable results on different OSes. Fixed the failure by typecasting to 
int in this patch.

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
 HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-29 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Attachment: HIVE-6287.3.patch

Reuploading patch again for precommit tests.

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
 HIVE-6287.3.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-28 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Attachment: HIVE-6287.2.patch

Addressed [~gopalv]'s review comment. 

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
 HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-28 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Attachment: HIVE-6287.3.patch

Patch number should be .3. Reuploading it.

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
 HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-28 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Attachment: (was: HIVE-6287.2.patch)

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
 HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-24 Thread Eric Hanson (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Hanson updated HIVE-6287:
--

Description: 
nextBatch() method that computes the batchSize is only aware of stripe 
boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled 
as PPD works at row group level (stripe contains multiple row groups). By 
default, row group stride is 1. When PPD is enabled, some row groups may 
get eliminated. After row group elimination, disk ranges are computed based on 
the selected row groups. If batchSize computation is not aware of this, it will 
lead to BufferUnderFlowException (reading beyond disk range). Following 
scenario should illustrate it more clearly

{code}
|- STRIPE 1 
|
|-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
--|
|- diskrange 1 -|   |- diskrange 2 
-|
^
 (marker)   
{code}

diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
nextBatch() was not aware of row groups and hence the diskranges, it tries to 
read 1024 values from the end of diskrange 1 where it should only read 2 % 
1024 = 544 values. This will result in BufferUnderFlowException.

To fix this, a marker is placed at the end of each range and batchSize is 
computed accordingly. {code}batchSize = 
Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code}

  was:
nextBatch() method that computes the batchSize is only aware of stripe 
boundaries. This will not work when PPD in ORC is enabled as PPD works at row 
group level (stripe contains multiple row groups). By default, row group stride 
is 1. When PPD is enabled, some row groups may get eliminated. After row 
group elimination, disk ranges are computed based on the selected row groups. 
If batchSize computation is not aware of this, it will lead to 
BufferUnderFlowException (reading beyond disk range). Following scenario should 
illustrate it more clearly

{code}
|- STRIPE 1 
|
|-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
--|
|- diskrange 1 -|   |- diskrange 2 
-|
^
 (marker)   
{code}

diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
nextBatch() was not aware of row groups and hence the diskranges, it tries to 
read 1024 values from the end of diskrange 1 where it should only read 2 % 
1024 = 544 values. This will result in BufferUnderFlowException.

To fix this, a marker is placed at the end of each range and batchSize is 
computed accordingly. {code}batchSize = 
Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code}


 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a 

[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-24 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Attachment: HIVE-6287.2.patch

Reuploading the same patch for HIVE QA to pick up.

Thanks [~ehans] for the update to description.

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when predicate pushdown (PPD) in ORC is 
 enabled as PPD works at row group level (stripe contains multiple row 
 groups). By default, row group stride is 1. When PPD is enabled, some row 
 groups may get eliminated. After row group elimination, disk ranges are 
 computed based on the selected row groups. If batchSize computation is not 
 aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
 range). Following scenario should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-23 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Attachment: HIVE-6287.WIP.patch

Uploading WIP patch. Q file tests are yet to be added.

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when PPD in ORC is enabled as PPD works at row 
 group level (stripe contains multiple row groups). By default, row group 
 stride is 1. When PPD is enabled, some row groups may get eliminated. 
 After row group elimination, disk ranges are computed based on the selected 
 row groups. If batchSize computation is not aware of this, it will lead to 
 BufferUnderFlowException (reading beyond disk range). Following scenario 
 should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-23 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Status: Patch Available  (was: Open)

Making it patch available for HIVE QA to pickup.

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when PPD in ORC is enabled as PPD works at row 
 group level (stripe contains multiple row groups). By default, row group 
 stride is 1. When PPD is enabled, some row groups may get eliminated. 
 After row group elimination, disk ranges are computed based on the selected 
 row groups. If batchSize computation is not aware of this, it will lead to 
 BufferUnderFlowException (reading beyond disk range). Following scenario 
 should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-23 Thread Prasanth J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6287:
-

Attachment: HIVE-6287.1.patch

Added q file tests.

 batchSize computation in Vectorized ORC reader can cause 
 BufferUnderFlowException when PPD is enabled
 -

 Key: HIVE-6287
 URL: https://issues.apache.org/jira/browse/HIVE-6287
 Project: Hive
  Issue Type: Bug
  Components: Vectorization
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
  Labels: orcfile, vectorization
 Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch


 nextBatch() method that computes the batchSize is only aware of stripe 
 boundaries. This will not work when PPD in ORC is enabled as PPD works at row 
 group level (stripe contains multiple row groups). By default, row group 
 stride is 1. When PPD is enabled, some row groups may get eliminated. 
 After row group elimination, disk ranges are computed based on the selected 
 row groups. If batchSize computation is not aware of this, it will lead to 
 BufferUnderFlowException (reading beyond disk range). Following scenario 
 should illustrate it more clearly
 {code}
 |- STRIPE 1 
 |
 |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
 --|
 |- diskrange 1 -|   |- diskrange 
 2 -|
 ^
  (marker)   
 {code}
 diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
 nextBatch() was not aware of row groups and hence the diskranges, it tries to 
 read 1024 values from the end of diskrange 1 where it should only read 2 
 % 1024 = 544 values. This will result in BufferUnderFlowException.
 To fix this, a marker is placed at the end of each range and batchSize is 
 computed accordingly. {code}batchSize = 
 Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
 rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)