Alex, At the sync up, you asked about why we didn't add dictionary predicate evaluation into the filter api and for the most part it was driven by how presto does PPD. They prefer to do it via there TupleDomain implementation which is pretty efficient for normal files and also works well with the vectorized read path we're using. We also have custom logic for the parquet read path to make things fast for presto, so this made sense.
But I'm also interested in getting this into pig/hive/spark because we have a lot of ETL and other use cases where we can sort our data to take advantage of this. I thought I'd reach out to you with a few of the issues I saw upon looking into how the filter api works currently. The filter2 has the right level of support for skipping the entire row group, which is good, but what is missing is enough context to actually read the dictionary for a column. From the api, I have the column chunk metadata, which includes the necessary offsets, but doesn't have a reference to either the path of the file being processed or the file reader to access the stream. In our implementation with presto, we simply read the dictionary directly after reading the footer, which does incur the cost of a second read when a match is located, but we want the dictionary before we load the column data into memory. This could get somewhat complicated with the column projection logic if we only want a single read. Just wanted to see if you had any thoughts about this before I dive into the necessary changes to the api to support this. Thanks, Dan
