paul-rogers commented on a change in pull request #2419:
URL: https://github.com/apache/drill/pull/2419#discussion_r786323997
##########
File path:
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogFormatPlugin.java
##########
@@ -40,18 +40,16 @@
private static class HttpLogReaderFactory extends FileReaderFactory {
private final HttpdLogFormatConfig config;
- private final int maxRecords;
private final EasySubScan scan;
- private HttpLogReaderFactory(HttpdLogFormatConfig config, int maxRecords,
EasySubScan scan) {
+ private HttpLogReaderFactory(HttpdLogFormatConfig config, EasySubScan
scan) {
this.config = config;
- this.maxRecords = maxRecords;
this.scan = scan;
}
@Override
- public ManagedReader<? extends FileScanFramework.FileSchemaNegotiator>
newReader() {
- return new HttpdLogBatchReader(config, maxRecords, scan);
+ public ManagedReader newReader(FileSchemaNegotiator negotiator) throws
EarlyEofException {
Review comment:
To see this in action, take a look at
`TestScanBasics.testEOFOnFirstOpen()`. If the exception is thrown, the scan
framework skips this reader and moves to the next. This exception reports that
"Hey, I have no data and no schema; please ignore me." Doesn't happen very
often, but this is a safety-valve when it does happen.
Suppose that the file (or result set) is empty an the constructor does not
throw an error. In that case, the scan framework calls `next()` which gathers
no rows, and returns `false`, which indicates EOF. If there is no schema also,
then this case is the same as if the `EarlyEofException` was thrown.
On the other hand, for or things that can provide a schema, even without
rows, then the first `next()` can build that schema, so we can return that
downstream, even if there are no rows.
With all of that, we can handle the various use cases:
* Reader that has no data at all, and can't even get its act together enough
to service a `next()` call: throw `EarlyEofException` from the constructor, and
the reader is skipped. Example: a CSV file that existed at plan time, but is
now gone.
* Reader that has no data, but can provide a fixed schema. The constructor
builds the schema and returns. The first call to `next()` returns `false`, with
no data.
* Reader that has no data, and can provide a schema without it, but only by
retrieving results from somewhere else, such as a JDBC connection that can
return a schema even if there are no rows. The constructor does nothing with
schema. The first `read()` builds the schema, but provides no rows. That
`next()` returns `false` so we send the schema, but no data, downstream.
* Normal case: the reader either has a fixed schema in the constructor, or
discovers the schema on the first `next()`, and also reads data until EOF,
loading the data into batches, one per `next()` call.
Sorry this is so complex! But, this is is all essential to fully support all
the many crazy readers and the "schema-on-read, but only sometimes" world in
which Drill operates.
##########
File path:
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogBatchReader.java
##########
@@ -40,36 +41,29 @@
import java.io.InputStream;
import java.io.InputStreamReader;
-public class HttpdLogBatchReader implements
ManagedReader<FileSchemaNegotiator> {
+public class HttpdLogBatchReader implements ManagedReader {
private static final Logger logger =
LoggerFactory.getLogger(HttpdLogBatchReader.class);
public static final String RAW_LINE_COL_NAME = "_raw";
public static final String MATCHED_COL_NAME = "_matched";
private final HttpdLogFormatConfig formatConfig;
- private final int maxRecords;
- private final EasySubScan scan;
- private HttpdParser parser;
- private FileSplit split;
+ private final HttpdParser parser;
+ private final FileDescrip file;
private InputStream fsStream;
- private RowSetLoader rowWriter;
+ private final RowSetLoader rowWriter;
private BufferedReader reader;
private int lineNumber;
- private CustomErrorContext errorContext;
- private ScalarWriter rawLineWriter;
- private ScalarWriter matchedWriter;
+ private final CustomErrorContext errorContext;
+ private final ScalarWriter rawLineWriter;
+ private final ScalarWriter matchedWriter;
private int errorCount;
-
- public HttpdLogBatchReader(HttpdLogFormatConfig formatConfig, int
maxRecords, EasySubScan scan) {
+ public HttpdLogBatchReader(HttpdLogFormatConfig formatConfig, EasySubScan
scan, FileSchemaNegotiator negotiator) {
this.formatConfig = formatConfig;
- this.maxRecords = maxRecords;
- this.scan = scan;
- }
- @Override
- public boolean open(FileSchemaNegotiator negotiator) {
Review comment:
There is a more fundamental consideration. Drill is distributed: we want
to hold resources as short a time as possible. In the previous design,
developers had to understand to not open files or obtain resources in the
constructor. They had to remember to do that in `open()`. A side effect is that
many variables where thus non-final.
Now, the readers follow the Java `Stream`, `Reader` and `Writer` convention:
the constructor opens any external files, connections or other resources.
`close()` releases them. The framework ensures that these operations span the
shortest possible amount of time.
For background, in the pre-EVF (so-called "scan V1") code, it was common for
people to open files in the constructor, but we'd create all the readers up
front: so we'd have lots of open files, lots of unused buffers, etc. Now, the
developer simply can't do things wrong: the constructor is called as late as
possible, and it is the only place available to obtain resources. The scan
calls `close()` as early as possible (and on an error) so resources can't leak
or sit idle.
Again, sorry the scan is so complex. At least, with EVF, all the logic is in
one place rather than copy-pasted across all readers (where it was often done
wrongly) as in "scan V1."
##########
File path:
contrib/format-httpd/src/main/java/org/apache/drill/exec/store/httpd/HttpdLogFormatPlugin.java
##########
@@ -75,24 +73,15 @@ private static EasyFormatConfig easyConfig(Configuration
fsConf, HttpdLogFormatC
.fsConf(fsConf)
.defaultName(DEFAULT_NAME)
.readerOperatorType(OPERATOR_TYPE)
- .useEnhancedScan(true)
+ .scanVersion(ScanFrameworkVersion.EVF_V2)
.supportsLimitPushdown(true)
.build();
}
@Override
- public ManagedReader<? extends FileSchemaNegotiator> newBatchReader(
- EasySubScan scan, OptionManager options) {
- return new HttpdLogBatchReader(formatConfig, scan.getMaxRecords(), scan);
- }
-
- @Override
- protected FileScanFramework.FileScanBuilder frameworkBuilder(OptionManager
options, EasySubScan scan) {
Review comment:
Good question. As it turns out, this is done in the base mechanism. We
get here (actually, there, in the other column, in the new code) from the
`EasyFormatPlugin`:
```java
private CloseableRecordBatch buildScanV3(FragmentContext context,
EasySubScan scan) throws ExecutionSetupException {
EasyFileScanBuilder builder = new EasyFileScanBuilder(context, scan,
this);
configureScan(builder, scan);
return builder.buildScanOperator(context, scan);
}
```
The base framework, in `EasyFileScanBuilder`, passes the options and a bunch
of other general state into the builder so that we don't have redundant code in
each reader. The reader only does the bits unique to that reader.
Note, the `OptionsManager` should be used only by code that updates options.
Code that reads options, such as operators, should use `OptionSet` instead,
which is what the new EVF V3 code uses.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]