I think the following makes sense (from your email) null -> scan-all non-empty with star -> scan-all non-empty w/o star -> regular / scan-some empty -> count(*) / skip-all
The reason I identified it in the DrillScanRel is because the planner actually returns an empty set for the count(*) case as that actually gets rewritten as count(0) or count(1) if I recall correctly. As such, the list of columns needed to satisfy are the empty set. Agreed that most (all?) readers don't support currently. We should address this separately and make sure the readers handle what they are given (e.g. if they don't support skip-all, they should convert to something they do support) rather than removing the capability in the Rel layer. On Sun, Feb 8, 2015 at 2:44 PM, Hanifi Gunes <[email protected]> wrote: > - We should treat an empty list different from a null list different from > an all list. > Agreed. What would each mean to DrillScanRel though? > null -> scan-all or problem? > non-empty with star -> scan-all > non-empty w/o star -> regular / scan-some > empty -> count(*) / skip-all > > > I have no concrete understanding of what planner outputs as the list of > columns in case of count(*) vs select * queries. It would be nice to > differentiate these cases without use of "null" to mean anything special > though. Earlier, planner -at least in DrillScanRule- used to invoke a c'tor > that used to pass "null" to mean "all" which led some ambiguity: is it that > code is broken or callee asks me to scan everything? This was one of the > reasons for disallowing nulls. Another one was to wipe out all null checks > on columns from all of the readers. > > > The other part of the consideration is readers. Readers don't seem to > support skip-all queries particularly for count. Looks like the current > JsonReader implementation(so does mongo reader) after major refactoring > also relies on a non-empty list of columns while ensuring at least one > column is scanned for count queries, which I doubt works as intended in > case an empty list is used to mean skip-all. Part of the code change to > make count(*) more efficient is to make every reader understand skip-all > for count queries. > > > Earlier we had some discussions on how to make count(*) more efficient as > well. A good starting point could be to resolve ambiguity at the planner > side with the help of planning experts perhaps. If reader knows for sure > that she executing a count(*) query, she should be able to output a cheap > count vector instead of scanning everything without enforcing schema check > on the count column for readers that is schema-ed. > > How does this sound? > > > Thanks. > -Hanifi > > > On Sat, Feb 7, 2015 at 6:47 PM, Jacques Nadeau <[email protected]> wrote: > > > Hey Guys, > > > > When reviewing our code I ran across this code: > > > > > > > https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java#L70 > > > > The problem with this code is it converts no columns into every column. > > This means we waste a bunch of time readers where we don't need to > project > > any columns. An example is select count(*) which turns into select > > count(0). In that case, we should avoid projecting any columns other > than > > a count column. However, because we translate no columns into every > > column, we read way more than we should. I remember someone (maybe > Hanifi) > > working through some rules around not allowing a blank list of columns > > anywhere. What was the thinking and how do we correct. We should treat > an > > empty list different from a null list different from an all list. > > > > thoughts? > > >
