Hello Drillers, I am currently working on trying to write documentation to describe our current interface and implementation patterns used in RecordBatch and its subclasses. These classes currently contain the implementations of all of our physical operators, subclasses include FilterRecordBatch, HashAggBatch, etc.
This naming convention has been a point of confusion for many developers as they get up to speed on Drill and begin to piece together the control flow of a query. The name "RecordBatch" implies that the class is logically a data structure, that holds a batch of records. During execution, each downsteam operator (which implements the RecordBatch interface) will be able to access all of the data in the current batches (the actual data structure) from the operator(s) immediately preceding it. In this sense, calling this class a RecordBatch is not entirely inaccurate, as it is providing a reference into the current data. The place where it gets confusing, is that it does not just hold data. Each RecordBatch has a next() method, which is used to retrieve the next batch of records (the data structure). The way this is possible is that the data is shared with consumers of the interface in the form of a vector container object, which wraps value vectors. A call to next will swap out the data in the vector containers with new data. I was talking with a few members of the dev team about this problem and we were all in agreement that the interface and its implementations should be renamed. We tried to talk further about the overall model and decided that some refactoring/ encapsulation may come along with this re-naming as we clarify these concepts. I would like to propose the beginning of this discussion with our candidates for new names of the interface. The three that stood out for us were BatchIterator, BatchStream, and BatchCursor. These all represent a logical wrapper around data that will be accessed by a consumer over time, and will be accessed in discrete chunks at some level. Each has existing conventions that define them, and some might be more appropriate than others for the current implementation used by Drill. Please share your thoughts on the best possible new name for RecordBatch. Thanks, Jason
