pavibhai commented on a change in pull request #668:
URL: https://github.com/apache/orc/pull/668#discussion_r610833399
##########
File path: java/core/src/java/org/apache/orc/impl/reader/tree/TypeReader.java
##########
@@ -24,22 +25,51 @@
import org.apache.orc.impl.reader.StripePlanner;
import java.io.IOException;
+import java.util.EnumSet;
public interface TypeReader {
void checkEncoding(OrcProto.ColumnEncoding encoding) throws IOException;
- void startStripe(StripePlanner planner) throws IOException;
+ void startStripe(StripePlanner planner, EnumSet<ReadLevel> readLevel) throws
IOException;
- void seek(PositionProvider[] index) throws IOException;
+ void seek(PositionProvider[] index, EnumSet<ReadLevel> readLevel) throws
IOException;
- void seek(PositionProvider index) throws IOException;
+ void seek(PositionProvider index, EnumSet<ReadLevel> readLevel) throws
IOException;
- void skipRows(long rows) throws IOException;
+ void skipRows(long rows, EnumSet<ReadLevel> readLevel) throws IOException;
void nextVector(ColumnVector previous,
boolean[] isNull,
int batchSize,
- FilterContext filterContext) throws IOException;
+ FilterContext filterContext,
+ EnumSet<ReadLevel> readLevel) throws IOException;
int getColumnId();
+
+ ReadLevel getReadLevel();
+
+ /**
+ * Determines if the child of the parent should be allowed based on the read
level. The child
+ * is allowed based on the read level or if the child is a LEAD_PARENT, this
allows the handling
+ * of FOLLOW children on the LEAD_PARENT
+ * @param reader the child reader that is being evaluated
+ * @param readLevel the requested read level
+ * @return true if allowed by read level or if it is a LEAD_PARENT otherwise
false
+ */
+ static boolean allowChild(TypeReader reader, EnumSet<ReadLevel> readLevel) {
+ return readLevel.contains(reader.getReadLevel())
+ || reader.getReadLevel() == ReadLevel.LEAD_PARENT;
+ }
+
+ enum ReadLevel {
+ LEAD_CHILD, // Read only the elementary filter columns
Review comment:
Sure.
Consider the following schema and classification
* f1: Long
* f2: String
* f3: Struct
* a: String
* b: Long
In the above schema if the filter was based on f2 and f3.a then the
classification will be as follows:
* f1: Long: FOLLOW
* f2: String: LEAD_CHILD
* f3: Struct: LEAD_PARENT
* a: String: LEAD_CHILD
* b: Long: FOLLOW
The difference of LEAD_PARENT comes into play from the standpoint of seeks.
Where a seek for follow involves the seek of the LEAD_PARENTs as well to
determine the non null rows in the reader.
This is documented in
[prepareFollowReaders](https://github.com/apache/orc/blob/cc48b40e245f717ffc79c170f94f750da4fccf32/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L1305)
This will come into play when let us say the lead read has already happened
to say Row 8123 while the previous follow read has happened to say 1000. Now we
want to skip to rows 8123 on the follow readers, this would require the
determination of number of non null values on the parent, in our example for
f3.b.
This is achieved in the following steps:
1. Seek LEAD_PARENT to follow position i.e. 1000
2. Seek LEAD_PARENT + FOLLOW to desired position 8123, this will ensure that
non nulls is determined correctly and the follows are seeked.
LEAD_CHILD are unaffected and remain at 1000 through the above process, at
the end all of them are on 1000. Hope this is clear. If this is better handled
via a call, I am open for it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]