Resolving ScanBatch.next() behavior for 0-row readers; handling of NONE, OK_NEW_SCHEMA

Daniel Barclay Fri, 18 Sep 2015 15:27:33 -0700

What sequence of 
RecordBatch.IterOutcome<https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106><https://github.com/dsbos/incubator-drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L41>
 values should ScanBatch's next() return for a reader (file/etc.) that has zero rows of 
data, and what does that sequence depend on (e.g., whether there's still a non-empty schema 
even though there are no rows, whether there other files in the scan)?  [See other 
questions at bottom.]



I'm trying to resolve this question to fix DRILL-2288 
<https://issues.apache.org/jira/browse/DRILL-2288>. Its initial symptom was that 
INFORMATION_SCHEMA queries that return zero rows because of pushed-down filtering yielded 
results that have zero columns instead of the expected columns.  An additional symptom was that 
"SELECT A, B, *" from an empty JSON file yielded zero columns instead of the expected 
columns A and B (with zero rows).

The immediate cause of the problem (the missing schema information) was how 
ScanBatch.next() handled readers that returned no rows:

If a reader has no rows at all, then the first call to its next() method (from 
ScanBatch.next()) returns zero (indicating that there are no more rows, and, in 
this case, no rows at all), and ScanBatch.next()'s call to the reader's 
mutator's isNewSchema() returns true, indicating that the reader has a schema 
that ScanBatch has not yet processed (e.g., notified its caller about).

The way ScanBatch.next()'s code checked those conditions, when the last reader 
had no rows at all, ScanBatch.next() returned IterOutcome.NONE.

However, when that /last /reader was the /only /reader, that returning of 
IterOutcome.NONE for a no-rows reader by ScanBatch.next() meant that next() 
never returned IterOutcome.OK_NEW_SCHEMA for that ScanBatch.

That immediate return of NONE in turn meant that the downstream operator _never 
received a return value of __OK_NEW_SCHEMA__to trigger its schema processing_.  
(For example, in the DRILL-2288 JSON case, the project operator never 
constructed its own schema containing columns A and B plus whatever columns 
(none) came from the empty JSON file; in DRILL-2288 's other case, the caller 
never propagated the statically known columns from the INFORMATION_SCHEMA 
table.)

That returning of NONE without ever returning OK_NEW_SCHEMA also violates the 
(apparent) intended call/return protocol (sequence of IterOutcome values) for 
RecordBatch.next(). (See the draft Javadoc comments currently at 
RecordBatch.IterOutcome 
<https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106>.)


Therefore, it seems that ScanBatch.next() _must_ return OK_NEW_SCHEMA before 
returning NONE, instead of immediately returning NONE, for readers/files with 
zero rows for at least _some_ cases.  (It must both notify the downstream 
caller that there is a schema /and/ give the caller a chance to read the schema 
(which is allowed after OK_NEW_SCHEMA is returned but not after NONE).)

However, it is not clear exactly what that set of cases is.  (It does not seem 
to be _all_ zero-row cases--returning OK_NEW_SCHEMA before returning NONE in 
all zero-row cases causes lots of errors about schema changes.)

At a higher level, the question is how zero-row files/etc. should interact with 
sibling files/etc. (i.e., when they do and don't cause a schema change).  Note 
that some kinds of files/sources still have a schema even when they have zero 
rows of data (e.g., Parquet files, right?), while other kinds of files/source 
can't define (imply) any schema unless they have at least one row (e.g., JSON 
files).


In my in-progress fix 
<https://github.com/dsbos/incubator-drill/tree/bugs/WORK_2288_3641_3659>for 
DRILL-2288, I have currently changed ScanBatch.next()so that when the last reader has 
zero rows and next()would have returned NONE, next() now checks whether it has 
returned OK_NEW_SCHEMA yet (per any earlier files/readers), and, if so, now returns 
OK_NEW_SCHEMA, still returning NONE if not.  (Note that, currently, that is 
regardless of whether the reader has no schema (as from an empty JSON file) or has a 
schema.)

That change fixed the DRILL-2288 symptoms (apparently by giving 
downstream/calling operators notification that they didn't get before).

The change initially caused problems in UnionAllRecordBatch, because its code 
checked for NONE vs. OK_NEW_SCHEMA to try to detect empty inputs rather than 
checking directly. UnionAllRecordBatch has been fixed (in the in-progress fix 
for DRILL-2288).

However, that change still causes other schema-change problems.  The additional 
returns of OK_NEW_SCHEMA are causing some code to perceive unprocessable schema 
changes.  It is not yet clear whether the code should be checking the number of 
rows too, or OK_NEW_SCHEMA shouldn't be returned in as many subcases of the 
no-rows last-reader/file case.


So, some open and potential questions seem to be:

1. Is it the case that a) any batch's next() should return OK_NEW_SCHEMA before 
it returns NONE, and callers/downstream batches should be able to count on 
getting OK_NEW_SCHEMA (e.g., to trigger setting up their downstream schemas), 
or that b) empty files can cause next() to return NONE without ever returning 
OK_NEW_SCHEMA , and therefore all downstream batch classes must handle getting 
NONE before they have set up their schemas?
2. For a file/source kind that has a schema even when there are no rows, should 
getting an empty file constitute a schema change?  (On one hand there are no 
actual /rows/ (following the new schema) conflicting with any previous schema 
(and maybe rows), but on the other hand there is a non-empty /schema /that can 
conflict when that's enough to matter.)
3. For a file/source kind that implies a schema only when there are rows (e.g., 
JSON), when should or shouldn't that be considered a schema change?  If 
ScanBatch reads non-empty JSON file A, reads empty JSON file B, and reads 
non-empty JSON file C implying the same schema as A did, should that be 
considered to not be schema change or not?  (When reading no-/empty-schema B, 
should ScanBatch the keep the schema from A and check against that when it gets 
to C, effectively ignoring the existence of B completely?)
4. In ScanBatch.next(), when the last reader had no rows at all, when should 
next() return OK_NEW_SCHEMA? always? /iff/ the reader has a non-empty schema?  
just enough to never return NONE before returning OK_NEW_SCHEMA (which means it 
acts differently for otherwise-identical empty files, depending on what 
happened with previous readers)?  as in that last case except only if the 
reader has a non-empty schema?

Thanks,
Daniel

--
Daniel Barclay
MapR Technologies

Resolving ScanBatch.next() behavior for 0-row readers; handling of NONE, OK_NEW_SCHEMA

Reply via email to