[
https://issues.apache.org/jira/browse/ORC-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078464#comment-17078464
]
Ivan Dyptan commented on ORC-435:
---------------------------------
{code:java}
java -Xmx8g -jar /tmp/orc-tools-1.5.10-SNAPSHOT-uber.jar meta
/tmp/largestripe.orc
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more
info.
Processing data file /tmp/largestripe.orc [length: 2415919549]
Structure for /tmp/largestripe.orc
File Version: 0.12 with ORC_135
Rows: 18
Compression: NONE
Calendar: Julian/Gregorian
Type: struct<value:array<string>>
Stripe Statistics:
Stripe 1:
Column 0: count: 18 hasNull: false
Column 1: count: 18 hasNull: false bytesOnDisk: 4
Column 2: count: 36 hasNull: false bytesOnDisk: 2415919277 min: 1 max: 9
sum: 2415919131
File Statistics:
Column 0: count: 18 hasNull: false
Column 1: count: 18 hasNull: false bytesOnDisk: 4
Column 2: count: 36 hasNull: false bytesOnDisk: 2415919277 min: 1 max: 9 sum:
2415919131
Stripes:
Stripe: offset: 3 data: 2415919281 rows: 18 tail: 65 index: 42
Stream: column 0 section ROW_INDEX start: 3 length 8
Stream: column 1 section ROW_INDEX start: 11 length 12
Stream: column 2 section ROW_INDEX start: 23 length 22
Stream: column 1 section LENGTH start: 45 length 4
Stream: column 2 section DATA start: 49 length 2415919131
Stream: column 2 section LENGTH start: 2415919180 length 146
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DIRECT_V2
File length: 2415919549 bytes
Padding length: 0 bytes{code}
> Ability to read stripes that are greater than 2GB
> -------------------------------------------------
>
> Key: ORC-435
> URL: https://issues.apache.org/jira/browse/ORC-435
> Project: ORC
> Issue Type: Bug
> Components: Reader
> Affects Versions: 1.3.4, 1.4.4, 1.5.3, 1.6.0
> Reporter: Prasanth Jayachandran
> Assignee: Prasanth Jayachandran
> Priority: Major
> Fix For: 1.5.4, 1.6.0
>
>
> ORC reader fails with NegativeArraySizeException if the stripe size is >2GB.
> Even though default stripe size is 64MB there are cases where stripe size
> will reach >2GB even before memory manager can kick in to check memory size.
> Say if we are inserting 500KB strings (mostly unique) by the time we reach
> 5000 rows stripe size is already over 2GB. Reader will have to chunk the disk
> range reads for such cases instead of reading the stripe as whole blob.
> Exception thrown when reading such files
> {code:java}
> 2018-10-12 21:43:58,833 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : java.lang.NegativeArraySizeException
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1007)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:835)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1029)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1062)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1085){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)