Hello all, I was trying to document the interfaces we expect developers to interact with when working with Drill and ran into a possible refactoring we might want to do for UDFs. Currently the UDF interface takes an instance of RecordBatch (see the other discussion thread about re-naming this, this is our base class for operators, not a data structure) in the setup method for a UDF, designed to be run once before evaluating the function on any of the input data. Currently this input is rarely used, and I think it should possibly be removed.
The only current uses of this interface is finding out the current time and timezone from the fragment context of the record batch. We have a mechanism that currently allows providing a hook into the wider context of execution to UDFs in the form of the @Inject annotation. This is currently only implemented for a single type, DrillBuf, our primary storage buffer type for all data in Drill. These injected drillbufs currently allow providing a reusable temporary buffer that can be re-allocated as needed. This is used for cases where we have variable length data produced by a UDF and need a place to store the intermediate work of the function. To allow these buffers to be accounted for, they must be connected to the fragment's memory allocator, which is done when they are created and being injected into the runtime generated expression evaluation code. I believe we should do something similar to provide a wrapper object to the current time and timezone information, which is currently gathered from this direct reference to the RecordBatch provided in the setup method. I had tabled this work, as it was not a bug, but instead a clarification of an API. We should have a limited set of fragment/query context available to UDF writers and be explicit about it. This has re-emerged as I have been trying to allow for more advanced filters against our generated partition columns, to allow for at least constant expression evaluation in determining a folder or partition to read. The current use case I am trying to enable is finding a 'most recent' folder using the now() function and formatting the date to match a folder naming pattern for dates. To do this I have been looking at the Interpreted expression evaluation code that was added to the codebase but has not been hooked up to partition pruning. The interpreted expression evaluator currently passes a record batch into the evaluator to satisfy the interface of the UDFs, but a primary place where we were planning on using interpreted expression evaluation is at planning time, such as the case with partition pruning. At planning time we do not have a RecordBatch available to pass into the evaluator, and trying to create a mock implementation of the interface seems like a bit of a hack to say the least. Let me know your thoughts on how best to modify the interface. Thanks, Jason
