[
https://issues.apache.org/jira/browse/DRILL-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467742#comment-17467742
]
ASF GitHub Bot commented on DRILL-8085:
---------------------------------------
paul-rogers opened a new pull request #2419:
URL: https://github.com/apache/drill/pull/2419
# [DRILL-8085](https://issues.apache.org/jira/browse/DRILL-8085): EVF V2
support in the "Easy" format plugin
## Description
Provides an easy integration between the "Easy" format plugin and the "EVF
V2" structure, similar to what already exists for EVF V1. Revised the
limit-push-down approach. Fixes the JSON writer to handle `TIMESTAMP`
consistently with the reader (so certain tests pass on a non-UTC build
machine.) Numerous other bits of cleanup.
## Design Overview
Prior work added the "EVF" version 2 framework. Briefly, the original
"extended vector framework" (EVF) started with the vector accessor layer to
control vector sizes, then added the "result set loader" to provide an easy
interface for readers to write to vectors using the accessors. This still left
a large amount of plumbing to work out the "schema-on-read" set of columns to
use. "EVF V1" fumbled its way to a solution, but the result was complex and
limited. EVF V2 reworks the mechanism to cleanly support schema-on-read, a
provided schema, an up-front schema defined by the data source, and the set of
columns requested in the query. Figuring that out is amazingly complex (which
should tell us something about the wisdom of Drill's current schema-on-read
approach...)
The EVF V1 work split the scan operation into layers, layers which are
preserved in V2:
* The Volcano operator (AKA the poorly-named "record batch"):
`OperatorRecordBatch`
* The scan operator (`ScanOperatorExec`), which iterates over the readers
within a scan, and issues "events" to a listener: `ScanOperatorEvents`.
* The scan and reader event listeners for EVF 2. (The old listeners, for EVF
1, also still exist.)
* The format-specific plugins and readers.
* The `ResultSetLoader`, `ColumnWriter` and `ValueVector` abstractions.
The first two layers are meant to point the way to restructuring Drill's
overly-complex record batches by dividing up responsibilities. So far, only the
scan has been "modernized" this way.
The gist of this change is to modify the `EasyFormatPlugin` to add a third
way to build the readers. EVF V2 is simpler than V1, and far simpler than the
"classic" do-it-all-yourself approach. A setting in the `EasyFormatConfig`
class says which to use.
## Limit Pushdown
[DRILL-7763](https://issues.apache.org/jira/browse/DRILL-7763) added `LIMIT`
push-down to the file format readers. However, this work was limited in two
ways:
* The limit work was added to each reader by counting records in the
read-each-record loop.
* The limit was not applied across multiple readers in the same scan. If the
limit is 50, and the scan has two readers: R1 with 40 row, and R2 with 100,
rows, then we want to limit R2 to 50-40 = 10 rows.
* The code used `maxRecords=0` to mean "no limit." However, `LIMIT 0` is
commonly used by BI tools to obtain just a schema for a query. (Much work was
done back in 2015 to enable this feature throughout Drill.)
As it turns out, the `ResultSetLoader` already does most of this work to
enforce its various batch size limits. The cross-reader work should be done in
the scan framework. This PR adds this functionality so that the limit is passed
from the plan into the EVF V2 mechanism, which automagically computes and
enforces the per-reader limit. The `RowReader` will simply report "OK, stop
reading, the batch is full" when the limit is reached. The scan framework
detects that the batch was full because of the limit, and won't call your
reader for another batch. `LIMIT 0` now works properly as well.
As a result, to get full limit support, all a reader has to do is use EVF
V2, and do nothing else. Readers that did roll-your-own limit support should
rip out that code (that is, revert DRILL-7763). See the example
`HttpdLogFormatPlugin` (below) for what to do.
As a side-effect, the JSON field for the limit changed names from
`maxRecords` to `limit` so it is clear what it means.
## Examples
This PR includes conversion of the "compliant text" (AKA "CSV") reader to
use EVF V2. However, that reader is quite complex, and is not a good example
for others to follow. So, this PR also converted the `HttpdLogFormatPlugin`
which is a much better example for others to follow.
## Next Steps
Once this work is merged, I encourage the team to migrate the other plugins
that use EVF V1 to use V2. The code will be simpler and you get `LIMIT` support
for free. The readers will also automagically get support for a provided schema.
Once all plugins are converted, the next step will be to rip out the EVF V1
code since it is so complex even I (as the author) have to take lots of time to
understand it. Good riddance. "Build one to throw away" as Brooks said...
## Also includes
* [DRILL-8100](https://issues.apache.org/jira/browse/DRILL-8100): JSON
record writer does not convert Drill local timestamp to UTC.
* [DRILL-8086](https://issues.apache.org/jira/browse/DRILL-8086): Convert
the CSV (AKA "compliant text") reader to EVF V2.
* [DRILL-8084](https://issues.apache.org/jira/browse/DRILL-8084): Scan LIMIT
pushdown fails across files.
* Clean-up of several flaky tests.
`TestHttpPlugin` chose port 8091 which collides with other software (in this
case, Druid.) Ad-hoc uses should use higher numbers, so the port was changed to
44332. (It should ideally be randomly selected on server start.)
## `TIMESTAMP`
I struggled with
[DRILL-8101](https://issues.apache.org/jira/browse/DRILL-8101): Resolve the
TIMESTAMP madness. Drill treats the `TIMESTAMP` type as both a `DATETIME` (no
time zone) and a Unix-like timestamp (UTC time zone). The tests are playing
whack-a-mole as these conflicting interpretations collide: some tests only work
in the UTC time zone (but my machine is in the Pacific zone.)
This PR fixes the JSON writer so it converts from Drill's local-time
`TIMESTAMP` to UTC, to "undo" the UTC-to-local conversion on read. However,
this just caused other tests to fail since Parquet assumes that `TIMESTAMP` is
either UTC, or an unspecified timezone (so Parquet can assume UTC if it wants.)
As a result certain CTAS tests with both Parquet and JSON fail unless they are
run in UTC. Since the problem is too big to fix in this PR, I disabled those
tests. (The PR contains some clean-up from my attempt to find and understand
the problem.)
## Documentation
No user-visible changes.
## Testing
Added unit tests for new functionality. Revised existing tests where needed,
or where they were flaky on my machine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> EVF V2 support in the "Easy" format plugin
> ------------------------------------------
>
> Key: DRILL-8085
> URL: https://issues.apache.org/jira/browse/DRILL-8085
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.19.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
>
> Add support for EVF V2 to the {{EasyFormatPlugin}} similar to how EVF V1
> support already exists. Provide examples for others to follow.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)