Hi All,

Yesterday, in a conversation, Salim mentioned it would be handy to be able to 
capture and replay in-flight batches in a Drill query in order to diagnose 
problems. As it turns out, we have most of the pieces readily available; we 
just need someone to assemble them.

First, we have the IteratorValidatorBatchIterator class which sits on top of 
each operator and validates that operator’s state. We extended it a while back 
to validate vector internals to catch a few cases of offset vector corruption. 
This class could be extended to capture in-flight batches for selected 
operators.

Second, we have the VectorAccessibleSerializable class (and the recently added 
VectorSerializer wrapper class) that writes batches to, and reads batches from 
disk. This class is the foundation of our spilling support.

Third, we have the EasyFormatPlugin class that lets us easily create a new 
disk-based reader.

Combine them and we can use the validator to write batches using the vector 
serializer. Then, we create a new easy format plugin to read these files again 
using the vector serializer.

The good news is that most of these classes have been around since the early 
days, so any technique built using them should work for any older versions of 
Drill we need to debug. (Though, of course, we’d have to rebuild that old 
version to include the batch intercept code…)

Thanks,

- Paul

Reply via email to