[GitHub] [drill] paul-rogers commented on a change in pull request #2419: DRILL-8085: EVF V2 support in the "Easy" format plugin

GitBox Mon, 17 Jan 2022 15:06:26 -0800


paul-rogers commented on a change in pull request #2419:
URL: https://github.com/apache/drill/pull/2419#discussion_r786323997




##########
File path: 
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogFormatPlugin.java
##########
@@ -40,18 +40,16 @@
   private static class HttpLogReaderFactory extends FileReaderFactory {
 
     private final HttpdLogFormatConfig config;
-    private final int maxRecords;
     private final EasySubScan scan;
 
-    private HttpLogReaderFactory(HttpdLogFormatConfig config, int maxRecords, 
EasySubScan scan) {
+    private HttpLogReaderFactory(HttpdLogFormatConfig config, EasySubScan 
scan) {
       this.config = config;
-      this.maxRecords = maxRecords;
       this.scan = scan;
     }
 
     @Override
-    public ManagedReader<? extends FileScanFramework.FileSchemaNegotiator> 
newReader() {
-      return new HttpdLogBatchReader(config, maxRecords, scan);
+    public ManagedReader newReader(FileSchemaNegotiator negotiator) throws 
EarlyEofException {

Review comment:
       To see this in action, take a look at 
`TestScanBasics.testEOFOnFirstOpen()`. If the exception is thrown, the scan 
framework skips this reader and moves to the next. This exception reports that 
"Hey, I have no data and no schema; please ignore me." Doesn't happen very 
often, but this is a safety-valve when it does happen.
   
   Suppose that the file (or result set) is empty an the constructor does not 
throw an error. In that case, the scan framework calls `next()` which gathers 
no rows, and returns `false`, which indicates EOF. If there is no schema also, 
then this case is the same as if the `EarlyEofException` was thrown.
   
   On the other hand, for or things that can provide a schema, even without 
rows, then the first `next()` can build that schema, so we can return that 
downstream, even if there are no rows.
   
   With all of that, we can handle the various use cases:
   
   * Reader that has no data at all, and can't even get its act together enough 
to service a `next()` call: throw `EarlyEofException` from the constructor, and 
the reader is skipped. Example: a CSV file that existed at plan time, but is 
now gone.
   * Reader that has no data, but can provide a fixed schema. The constructor 
builds the schema and returns. The first call to `next()` returns `false`, with 
no data.
   * Reader that has no data, and can provide a schema without it, but only by 
retrieving results from somewhere else, such as a JDBC connection that can 
return a schema even if there are no rows. The constructor does nothing with 
schema. The first `read()` builds the schema, but provides no rows. That 
`next()` returns `false` so we send the schema, but no data, downstream.
   * Normal case: the reader either has a fixed schema in the constructor, or 
discovers the schema on the first `next()`, and also reads data until EOF, 
loading the data into batches, one per `next()` call.
   
   Sorry this is so complex! But, this is is all essential to fully support all 
the many crazy readers and the "schema-on-read, but only sometimes" world in 
which Drill operates.

##########
File path: 
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogBatchReader.java
##########
@@ -40,36 +41,29 @@
 import java.io.InputStream;
 import java.io.InputStreamReader;
 
-public class HttpdLogBatchReader implements 
ManagedReader<FileSchemaNegotiator> {
+public class HttpdLogBatchReader implements ManagedReader {
 
   private static final Logger logger = 
LoggerFactory.getLogger(HttpdLogBatchReader.class);
   public static final String RAW_LINE_COL_NAME = "_raw";
   public static final String MATCHED_COL_NAME = "_matched";
   private final HttpdLogFormatConfig formatConfig;
-  private final int maxRecords;
-  private final EasySubScan scan;
-  private HttpdParser parser;
-  private FileSplit split;
+  private final HttpdParser parser;
+  private final FileDescrip file;
   private InputStream fsStream;
-  private RowSetLoader rowWriter;
+  private final RowSetLoader rowWriter;
   private BufferedReader reader;
   private int lineNumber;
-  private CustomErrorContext errorContext;
-  private ScalarWriter rawLineWriter;
-  private ScalarWriter matchedWriter;
+  private final CustomErrorContext errorContext;
+  private final ScalarWriter rawLineWriter;
+  private final ScalarWriter matchedWriter;
   private int errorCount;
 
-
-  public HttpdLogBatchReader(HttpdLogFormatConfig formatConfig, int 
maxRecords, EasySubScan scan) {
+  public HttpdLogBatchReader(HttpdLogFormatConfig formatConfig, EasySubScan 
scan, FileSchemaNegotiator negotiator) {
     this.formatConfig = formatConfig;
-    this.maxRecords = maxRecords;
-    this.scan = scan;
-  }
 
-  @Override
-  public boolean open(FileSchemaNegotiator negotiator) {

Review comment:
       There is a more fundamental consideration. Drill is distributed: we want 
to hold resources as short a time as possible. In the previous design, 
developers had to understand to not open files or obtain resources in the 
constructor. They had to remember to do that in `open()`. A side effect is that 
many variables where thus non-final.
   
   Now, the readers follow the Java `Stream`, `Reader` and `Writer` convention: 
the constructor opens any external files, connections or other resources. 
`close()` releases them. The framework ensures that these operations span the 
shortest possible amount of time.
   
   For background, in the pre-EVF (so-called "scan V1") code, it was common for 
people to open files in the constructor, but we'd create all the readers up 
front: so we'd have lots of open files, lots of unused buffers, etc. Now, the 
developer simply can't do things wrong: the constructor is called as late as 
possible, and it is the only place available to obtain resources. The scan 
calls `close()` as early as possible (and on an error) so resources can't leak 
or sit idle.
   
   Again, sorry the scan is so complex. At least, with EVF, all the logic is in 
one place rather than copy-pasted across all readers (where it was often done 
wrongly) as in "scan V1."

##########
File path: 
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogFormatPlugin.java
##########
@@ -75,24 +73,15 @@ private static EasyFormatConfig easyConfig(Configuration 
fsConf, HttpdLogFormatC
         .fsConf(fsConf)
         .defaultName(DEFAULT_NAME)
         .readerOperatorType(OPERATOR_TYPE)
-        .useEnhancedScan(true)
+        .scanVersion(ScanFrameworkVersion.EVF_V2)
         .supportsLimitPushdown(true)
         .build();
   }
 
   @Override
-  public ManagedReader<? extends FileSchemaNegotiator> newBatchReader(
-    EasySubScan scan, OptionManager options) {
-    return new HttpdLogBatchReader(formatConfig, scan.getMaxRecords(), scan);
-  }
-
-  @Override
-  protected FileScanFramework.FileScanBuilder frameworkBuilder(OptionManager 
options, EasySubScan scan) {

Review comment:
       Good question. As it turns out, this is done in the base mechanism. We 
get here (actually, there, in the other column, in the new code) from the 
`EasyFormatPlugin`:
   
   ```java
     private CloseableRecordBatch buildScanV3(FragmentContext context,
         EasySubScan scan) throws ExecutionSetupException {
       EasyFileScanBuilder builder = new EasyFileScanBuilder(context, scan, 
this);
       configureScan(builder, scan);
       return builder.buildScanOperator(context, scan);
     }
   ```
   
   The base framework, in `EasyFileScanBuilder`, passes the options and a bunch 
of other general state into the builder so that we don't have redundant code in 
each reader. The reader only does the bits unique to that reader.
   
   Note, the `OptionsManager` should be used only by code that updates options. 
Code that reads options, such as operators, should use `OptionSet` instead, 
which is what the new EVF V3 code uses.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] paul-rogers commented on a change in pull request #2419: DRILL-8085: EVF V2 support in the "Easy" format plugin

Reply via email to