[jira] [Commented] (ORC-435) Ability to read stripes that are greater than 2GB

ASF GitHub Bot (JIRA) Mon, 19 Nov 2018 15:28:14 -0800


    [ 
https://issues.apache.org/jira/browse/ORC-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692422#comment-16692422
 ]


ASF GitHub Bot commented on ORC-435:
------------------------------------

xndai commented on a change in pull request #338: ORC-435: Ability to read 
stripes that are greater than 2GB
URL: https://github.com/apache/orc/pull/338#discussion_r234820639
 
 

 ##########
 File path: java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java
 ##########
 @@ -531,39 +534,51 @@ static DiskRangeList readDiskRanges(FSDataInputStream 
file,
         range = range.next;
         continue;
       }
-      int len = (int) (range.getEnd() - range.getOffset());
+      boolean hasReplaced = false;
+      long len = range.getEnd() - range.getOffset();
       long off = range.getOffset();
-      if (zcr != null) {
-        file.seek(base + off);
-        boolean hasReplaced = false;
-        while (len > 0) {
-          ByteBuffer partial = zcr.readBuffer(len, false);
-          BufferChunk bc = new BufferChunk(partial, off);
-          if (!hasReplaced) {
-            range.replaceSelfWith(bc);
-            hasReplaced = true;
+      while (len > 0) {
+        BufferChunk bc;
+
+        // Stripe could be too large to read fully into a single buffer and 
will need to be chunked
 
 Review comment:
   Just curious, under what condition will the reader read a stripe fully? At 
least I don't think there's such behavior on c++ side.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Ability to read stripes that are greater than 2GB
> -------------------------------------------------
>
>                 Key: ORC-435
>                 URL: https://issues.apache.org/jira/browse/ORC-435
>             Project: ORC
>          Issue Type: Bug
>          Components: Reader
>    Affects Versions: 1.3.4, 1.4.4, 1.6.0, 1.5.3
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>            Priority: Major
>             Fix For: 1.5.4, 1.6.0
>
>
> ORC reader fails with NegativeArraySizeException if the stripe size is >2GB. 
> Even though default stripe size is 64MB there are cases where stripe size 
> will reach >2GB even before memory manager can kick in to check memory size. 
> Say if we are inserting 500KB strings (mostly unique) by the time we reach 
> 5000 rows stripe size is already over 2GB. Reader will have to chunk the disk 
> range reads for such cases instead of reading the stripe as whole blob. 
> Exception thrown when reading such files
> {code:java}
> 2018-10-12 21:43:58,833 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.NegativeArraySizeException
>         at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
>         at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1007)
>         at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:835)
>         at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1029)
>         at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1062)
>         at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1085){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ORC-435) Ability to read stripes that are greater than 2GB

Reply via email to