[GitHub] [drill] paul-rogers edited a comment on issue #1813: DRILL-7306: Disable schema-only batch for new scan framework

GitBox Thu, 04 Jul 2019 15:19:56 -0700

paul-rogers edited a comment on issue #1813: DRILL-7306: Disable schema-only 
batch for new scan framework
URL: https://github.com/apache/drill/pull/1813#issuecomment-508583884
 
 
   Thanks, @arina-ielchiieva, for pointing me to the Parquet data sources. As 
it turns out, I don't think that is the correct set of files used by the test. 
If I manually count the matches for the "union03" query, I get three rows out 
of a total of 1500 rows in the customer table. The expected results shown in 
your earlier post show customer IDs beyond 1500, suggesting that the failed 
query ran against a larger file than the one in the directory you suggested.
   
   Unfortunately, I can't check the contents of the customer.parquet file 
because I can't get Parquet tools to work after several hours of fighting one 
thing after another. I seem to recall we discussed bundling that tool with 
Drill. Would sure be handy.
   
   Looking closer, it seems that the files in the test framework are for scale 
factor (SF) 0.1. But, the tests use files for SF1. So, I suspect I'm testing 
against files 1/10 the size of those used in the tests that failed. I'm 
guessing the test framework generates the SF1 files during its setup phase 
(which seems to require MFS to run.)
   
   Further, I'm completely mystified at how my changes could impact Parquet 
since the only changed source files are for the "new" scan, which Parquet does 
not use. Oddly, none of the text file queries fail; which is the area I *did* 
change.
   
   So, net status is that I'm stuck: can't reproduce the issue, can't inspect 
the data files, can't get access to the SF1 files, can't run the functional 
tests.
   
   Just to make sure I'm tracking down the correct issue: does the master 
branch pass these same tests? Using the same data files (that is, using the 
same cluster without rebuilding the functional tests?)
   
   Were the parquet files used in the tests rebuilt recently? Might there be a 
problem with the data itself?
   
   I can't tell what the framework is doing. Does it try to do a CSV query 
against the "golden" file to compare results? Though, the error seems to say 
that the Parquet query returned zero rows rather than that the Parquet results 
didn't match the "golden" CSV expected results.
   
   Any suggestions for how to proceed?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers edited a comment on issue #1813: DRILL-7306: Disable schema-only batch for new scan framework

Reply via email to