[GitHub] [orc] pavibhai commented on a change in pull request #668: ORC-742: Added Lazy IO for non filter columns

GitBox Fri, 09 Apr 2021 11:36:48 -0700


pavibhai commented on a change in pull request #668:
URL: https://github.com/apache/orc/pull/668#discussion_r610833399




##########
File path: java/core/src/java/org/apache/orc/impl/reader/tree/TypeReader.java
##########
@@ -24,22 +25,51 @@
 import org.apache.orc.impl.reader.StripePlanner;
 
 import java.io.IOException;
+import java.util.EnumSet;
 
 public interface TypeReader {
   void checkEncoding(OrcProto.ColumnEncoding encoding) throws IOException;
 
-  void startStripe(StripePlanner planner) throws IOException;
+  void startStripe(StripePlanner planner, EnumSet<ReadLevel> readLevel) throws 
IOException;
 
-  void seek(PositionProvider[] index) throws IOException;
+  void seek(PositionProvider[] index, EnumSet<ReadLevel> readLevel) throws 
IOException;
 
-  void seek(PositionProvider index) throws IOException;
+  void seek(PositionProvider index, EnumSet<ReadLevel> readLevel) throws 
IOException;
 
-  void skipRows(long rows) throws IOException;
+  void skipRows(long rows, EnumSet<ReadLevel> readLevel) throws IOException;
 
   void nextVector(ColumnVector previous,
                   boolean[] isNull,
                   int batchSize,
-                  FilterContext filterContext) throws IOException;
+                  FilterContext filterContext,
+                  EnumSet<ReadLevel> readLevel) throws IOException;
 
   int getColumnId();
+
+  ReadLevel getReadLevel();
+
+  /**
+   * Determines if the child of the parent should be allowed based on the read 
level. The child
+   * is allowed based on the read level or if the child is a LEAD_PARENT, this 
allows the handling
+   * of FOLLOW children on the LEAD_PARENT
+   * @param reader the child reader that is being evaluated
+   * @param readLevel the requested read level
+   * @return true if allowed by read level or if it is a LEAD_PARENT otherwise 
false
+   */
+  static boolean allowChild(TypeReader reader, EnumSet<ReadLevel> readLevel) {
+    return readLevel.contains(reader.getReadLevel())
+           || reader.getReadLevel() == ReadLevel.LEAD_PARENT;
+  }
+
+  enum ReadLevel {
+    LEAD_CHILD,    // Read only the elementary filter columns

Review comment:
       Sure.
   
   Consider the following schema and classification
   * f1: Long
   * f2: String
   * f3: Struct
     * a: String
     * b: Long
   
   In the above schema if the filter was based on f2 and f3.a then the 
classification will be as follows:
   * f1: Long: FOLLOW
   * f2: String: LEAD_CHILD
   * f3: Struct: LEAD_PARENT
     * a: String: LEAD_CHILD
     * b: Long: FOLLOW
   
   The difference of LEAD_PARENT comes into play from the standpoint of seeks. 
Where a seek for follow involves the seek of the LEAD_PARENTs as well to 
determine the non null rows in the reader.
   
   This is documented in 
[prepareFollowReaders](https://github.com/apache/orc/blob/cc48b40e245f717ffc79c170f94f750da4fccf32/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L1305)
   
   This will come into play when let us say the lead read has already happened 
to say Row 8123 while the previous follow read has happened to say 1000. Now we 
want to skip to rows 8123 on the follow readers, this would require the 
determination of number of non null values on the parent, in our example for 
f3.b.
   
   This is achieved in the following steps:
   1. Seek LEAD_PARENT to follow position i.e. 1000
   2. Seek LEAD_PARENT + FOLLOW to desired position 8123, this will ensure that 
non nulls is determined correctly and the follows are seeked.
   
   LEAD_CHILD are unaffected and remain at 1000 through the above process, at 
the end all of them are on 1000. Hope this is clear. If this is better handled 
via a call, I am open for it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] pavibhai commented on a change in pull request #668: ORC-742: Added Lazy IO for non filter columns

Reply via email to