[
https://issues.apache.org/jira/browse/ORC-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078445#comment-17078445
]
Ivan Dyptan commented on ORC-435:
---------------------------------
The file was generated with Spark 2.4.4 (which includes ORC 1.5.5) in the
following manner:
{code:java}
#!/bin/bash
MBFILE=/tmp/1mb.txt
OUTFILE=/tmp/2gbplus.txt
truncate --size 0 $MBFILE
truncate --size 0 $OUTFILE
for i in $(seq 1 16384); do
echo -n "1234567890123456789012345678901234567890123456789012345678901234" >>
$MBFILE
donefor r in $(seq 1 18); do
echo -n "$r," >> $OUTFILE for i in $(seq 1 128); do
cat $MBFILE >> $OUTFILE
done
done
{code}
The code from Spark:
{code:java}
val inputFile = sc.textFile("/tmp/2gbplus.txt")
val fileData = inputFile.map(x=>x.split(","))
val dataSet = fileData.toDS()
dataset.coalesce(1).write.option("hive.exec.orc.dictionary.key.size.threshold",0.0).option("orc.compress","NONE").format("orc").save("/tmp/largestripe.orc"
{code}
> Ability to read stripes that are greater than 2GB
> -------------------------------------------------
>
> Key: ORC-435
> URL: https://issues.apache.org/jira/browse/ORC-435
> Project: ORC
> Issue Type: Bug
> Components: Reader
> Affects Versions: 1.3.4, 1.4.4, 1.5.3, 1.6.0
> Reporter: Prasanth Jayachandran
> Assignee: Prasanth Jayachandran
> Priority: Major
> Fix For: 1.5.4, 1.6.0
>
>
> ORC reader fails with NegativeArraySizeException if the stripe size is >2GB.
> Even though default stripe size is 64MB there are cases where stripe size
> will reach >2GB even before memory manager can kick in to check memory size.
> Say if we are inserting 500KB strings (mostly unique) by the time we reach
> 5000 rows stripe size is already over 2GB. Reader will have to chunk the disk
> range reads for such cases instead of reading the stripe as whole blob.
> Exception thrown when reading such files
> {code:java}
> 2018-10-12 21:43:58,833 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : java.lang.NegativeArraySizeException
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1007)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:835)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1029)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1062)
> at
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1085){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)