HashJoin needs to first sniff the first data holding batch from the probe side 
in order to compute stats for the memory calculations used to spill. If the 
probe side is empty we will hit NONE. This is unavoidable and this is where the 
problem begins.

In the case of a right join HashJoin still builds HashTables, and the 
HashTables require build and probe batches to be updated simultaneously before 
being built. When the build and probe sides are updated the Hashtable expects a 
vector container with valid columns. If the columns are not present an IOB is 
thrown.

Then later the probing logic is called in HashJoinProbeTemplate, which expects 
the upstream probeBatch to return a valid record count.

There are a few ways to fix this problem:

  1. We can refactor the HashTable to update probe and build sides 
independently. This will have some corner cases though, and the HashTable has 
no unit tests, so it's risky. Or we can remove the code generation from the 
HashTable, which will also be somewhat time consuming. Or we could change the 
join logic to not build the HashTable in this case, which is another major 
change. Then we can also refactor HashJoinProbeTemplate to handle an empty 
probe side gracefully, but again HashJoinProbeTemplate is not unit tested so 
this will be time consuming.

  2. We leave the HashTable and HashJoinProbeTemplate as is, but use the last 
schema we got from the probe side to build dummy empty vector containers for 
building the HashTable, and for probing in HashJoinProbeTemplate. IMO this is a 
hack. Also there are some corner cases because some Drill operators violate the 
contract that they must first send OK_NEW_SCHEMA and then NONE. I've observed 
in some unit tests some readers directly skip to sending NONE without 
OK_NEW_SCHEMA if there is no data.

 3. We define the state of a RecordBatch after it returns NONE to have a record 
count of zero, the last valid schema, and an empty VectorContainer. This would 
be defined should a downstream operator choose to query it. This resolves the 
issue in HashJoin without doing hacks and risky refactoring in a rush. It also 
defines state that was previously undefined, which IMO is a good thing. I would 
also do follow up PRs to make all operators obey this behavior, since it will 
be trivial to do.

 -  Of the three options, option 1 is the strictly correct thing to do if you 
don't consider time costs. However, due to the technical debt we've accumulated 
it will be time consuming to do.
 - Option 2 is a hack IMO and just makes things more confusing for the future 
generation of Drillers.
 - Option 3 is a win win. We get a clean resolution to the issue, and we define 
the state of operators after they return NONE. Defining state that was 
previously undefined is always a good thing, especially when it simplifies the 
logic of the system. And I can go through the operators and make sure they do 
this, it should be a trivial change involving zeroing VectorContainers before 
returning NONE and updating record counts to be 0.
  

[ Full content available at: https://github.com/apache/drill/pull/1472 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to