[GitHub] [orc] steveloughran commented on a diff in pull request #1276: ORC-1251:Support Vectored IO in ORC [Draft]

via GitHub Fri, 19 May 2023 10:06:38 -0700


steveloughran commented on code in PR #1276:
URL: https://github.com/apache/orc/pull/1276#discussion_r1199185617



##########
java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java:
##########
@@ -488,6 +504,65 @@ private static BufferChunk findSingleRead(BufferChunk 
first, long minSeekSize) {
     return last;
   }
 
+  /**
+   * Convert from DiskRangeList to FileRange
+   * @param range
+   * @return list of fileRange
+   */
+  static List<FileRange> getFileRangeFrom(BufferChunkList range) {
+    List<FileRange> fRange = new ArrayList<>();
+    BufferChunk cr = range.get();
+    while (cr.next != null) {
+      long len = cr.getLength();
+      long off = cr.getOffset();
+      FileRange currentRange = FileRange.createFileRange(off, (int) len);
+      fRange.add(currentRange);
+      cr = (BufferChunk) cr.next;
+    }
+    return fRange;
+  }
+
+  /**
+   * Read the list of ranges from the file.
+   * @param range the disk ranges within the stripe to read
+   * @return the bytes read for each disk range, which is the same length as
+   *    ranges
+   * @throws IOException
+   */
+  static void readDiskRangesVectored(FSDataInputStream fileInputStream,
+      BufferChunkList range, boolean doForceDirect) throws IOException {
+    if (range == null)
+      return;
+    BufferChunkList rootRange = range;
+    //Convert DiskRange to FileRange here
+    List<FileRange> fRanges = getFileRangeFrom(range);
+
+    try {
+      IntFunction<ByteBuffer> allocate = doForceDirect ? 
ByteBuffer::allocateDirect : ByteBuffer::allocate;
+
+      fileInputStream.readVectored(fRanges, allocate);
+
+      int index = 0;
+      range = rootRange;
+      BufferChunk current = range.get();
+      while (current != null) {
+        if (current.hasData()) {
+          current = (BufferChunk) current.next;
+          index ++;
+        }
+        ByteBuffer data = fRanges.get(index).getData().get();
+        current.setChunk(data);
+        //replace the range data with the buffer chunk returned after reading
+        current = (BufferChunk) current.next;
+      }
+    } catch (InterruptedException e) {
+      Thread.currentThread().interrupt();
+      throw new IOException(e);
+    } catch (ExecutionException e) {

Review Comment:
   org.apache.hadoop.util.functional.FutureIO has some methods to help extract 
useful IOEs from ExecutionExceptions -see unwrapInnerException()



##########
java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java:
##########
@@ -488,6 +504,65 @@ private static BufferChunk findSingleRead(BufferChunk 
first, long minSeekSize) {
     return last;
   }
 
+  /**
+   * Convert from DiskRangeList to FileRange
+   * @param range
+   * @return list of fileRange
+   */
+  static List<FileRange> getFileRangeFrom(BufferChunkList range) {
+    List<FileRange> fRange = new ArrayList<>();
+    BufferChunk cr = range.get();
+    while (cr.next != null) {
+      long len = cr.getLength();
+      long off = cr.getOffset();
+      FileRange currentRange = FileRange.createFileRange(off, (int) len);
+      fRange.add(currentRange);
+      cr = (BufferChunk) cr.next;
+    }
+    return fRange;
+  }
+
+  /**
+   * Read the list of ranges from the file.
+   * @param range the disk ranges within the stripe to read
+   * @return the bytes read for each disk range, which is the same length as
+   *    ranges
+   * @throws IOException
+   */
+  static void readDiskRangesVectored(FSDataInputStream fileInputStream,
+      BufferChunkList range, boolean doForceDirect) throws IOException {
+    if (range == null)
+      return;
+    BufferChunkList rootRange = range;
+    //Convert DiskRange to FileRange here
+    List<FileRange> fRanges = getFileRangeFrom(range);
+
+    try {
+      IntFunction<ByteBuffer> allocate = doForceDirect ? 
ByteBuffer::allocateDirect : ByteBuffer::allocate;
+
+      fileInputStream.readVectored(fRanges, allocate);
+
+      int index = 0;
+      range = rootRange;
+      BufferChunk current = range.get();
+      while (current != null) {

Review Comment:
   because FileRange has a reference field, you can put a reference back to the 
specific BufferChunk instances for each range. Then waiting for results becomes 
a matter of
   1. waiting for the ranged reads to complete
   2. resolving the references for each range
   3. set their Chunk bytebuffers to the ranges read.



##########
java/pom.xml:
##########
@@ -70,8 +70,8 @@
     <test.tmp.dir>${project.build.directory}/testing-tmp</test.tmp.dir>
     <example.dir>${project.basedir}/../../examples</example.dir>
 
-    <min.hadoop.version>2.7.3</min.hadoop.version>
-    <hadoop.version>2.7.3</hadoop.version>
+    <min.hadoop.version>3.3.9-SNAPSHOT</min.hadoop.version>

Review Comment:
   set to 3.3.5



##########
java/shims/pom.xml:
##########
@@ -59,6 +59,12 @@
       <artifactId>junit-jupiter-api</artifactId>
       <scope>test</scope>
     </dependency>
+      <dependency>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-hdfs-client</artifactId>
+          <version>3.3.9-SNAPSHOT</version>

Review Comment:
   is this really needed? If it is it's a failure in the api. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] steveloughran commented on a diff in pull request #1276: ORC-1251:Support Vectored IO in ORC [Draft]

Reply via email to