[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Description: nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} Stack trace will look like: {code} Caused by: java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:492) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:135) at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:207) at org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readFloat(SerializationUtils.java:70) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$FloatTreeReader.nextVector(RecordReaderImpl.java:673) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.nextVector(RecordReaderImpl.java:1615) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.nextBatch(RecordReaderImpl.java:2883) at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.next(VectorizedOrcInputFormat.java:94) ... 15 more {code} was: nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Fix For: 0.13.0 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1.
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated HIVE-6287: --- Fix Version/s: 0.13.0 batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Fix For: 0.13.0 Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated HIVE-6287: - Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk. Thanks [~prasanth_j]! batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Attachment: HIVE-6287.4.patch The test failure was related to rounding issues in sum(). Its producing unpredictable results on different OSes. Fixed the failure by typecasting to int in this patch. batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Attachment: HIVE-6287.3.patch Reuploading patch again for precommit tests. batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.3.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Attachment: HIVE-6287.2.patch Addressed [~gopalv]'s review comment. batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Attachment: HIVE-6287.3.patch Patch number should be .3. Reuploading it. batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Attachment: (was: HIVE-6287.2.patch) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Hanson updated HIVE-6287: -- Description: nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} was: nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when PPD in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Attachment: HIVE-6287.2.patch Reuploading the same patch for HIVE QA to pick up. Thanks [~ehans] for the update to description. batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Attachment: HIVE-6287.WIP.patch Uploading WIP patch. Q file tests are yet to be added. batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when PPD in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Status: Patch Available (was: Open) Making it patch available for HIVE QA to pickup. batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when PPD in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J updated HIVE-6287: - Attachment: HIVE-6287.1.patch Added q file tests. batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled - Key: HIVE-6287 URL: https://issues.apache.org/jira/browse/HIVE-6287 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 0.13.0 Reporter: Prasanth J Assignee: Prasanth J Labels: orcfile, vectorization Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when PPD in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 1. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly {code} |- STRIPE 1 | |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| |- diskrange 1 -| |- diskrange 2 -| ^ (marker) {code} diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 2 % 1024 = 544 values. This will result in BufferUnderFlowException. To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)