[
https://issues.apache.org/jira/browse/DRILL-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878899#comment-16878899
]
ASF GitHub Bot commented on DRILL-7306:
---------------------------------------
paul-rogers commented on issue #1813: DRILL-7306: Disable schema-only batch for
new scan framework
URL: https://github.com/apache/drill/pull/1813#issuecomment-508583884
Thanks, @arina-ielchiieva, for pointing me to the Parquet data sources. As
it turns out, I don't think that is the correct set of files used by the test.
If I manually count the matches for the "union03" query, I get three rows out
of a total of 1500 rows in the customer table. The expected results shown in
your earlier post show customer IDs beyond 1500, suggesting that the failed
query ran against a larger file than the one in the directory you suggested.
Unfortunately, I can't check the contents of the customer.parquet file
because I can't get Parquet tools to work after several hours of fighting one
thing after another. I seem to recall we discussed bundling that tool with
Drill. Would sure be handy.
Looking closer, it seems that the files in the test framework are for scale
factor (SF) 0.1. But, the tests use files for SF1. So, I suspect I'm testing
against files 1/10 the size of those used in the tests that failed. I'm
guessing the test framework generates the SF1 files during its setup phase
(which seems to require MFS to run.)
Further, I'm completely mystified at how my changes could impact Parquet
since the only changed source files are for the "new" scan, which Parquet does
not use. Oddly, none of the text file queries fail; which is the area I *did*
change.
So, net status is that I'm stuck: can't reproduce the issue, can't inspect
the data files, can't get access to the SF1 files, can't run the functional
tests.
Just to make sure I'm tracking down the correct issue: does the master
branch pass these same tests? Using the same data files (that is, using the
same cluster without rebuilding the functional tests?)
Were the parquet files used in the tests rebuilt recently? Might there be a
problem with the data itself?
I can't tell what the framework is doing. Does it try to do a CSV query
against the "golden" file to compare results? Though, the error seems to say
that the Parquet query returned zero rows rather than that the Parquet results
didn't match the "golden" CSV expected results.
Any suggestions for how to proceed?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Disable "fast schema" batch for new scan framework
> --------------------------------------------------
>
> Key: DRILL-7306
> URL: https://issues.apache.org/jira/browse/DRILL-7306
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.16.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
> Fix For: 1.17.0
>
>
> The EVF framework is set up to return a "fast schema" empty batch with only
> schema as its first batch because, when the code was written, it seemed
> that's how we wanted operators to work. However, DRILL-7305 notes that many
> operators cannot handle empty batches.
> Since the empty-batch bugs show that Drill does not, in fact, provide a "fast
> schema" batch, this ticket asks to disable the feature in the new scan
> framework. The feature is disabled with a config option; it can be re-enabled
> if ever it is needed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)