Alex,

At the sync up, you asked about why we didn't add dictionary predicate
evaluation into the filter api and for the most part it was driven by how
presto does PPD.  They prefer to do it via there TupleDomain implementation
which is pretty efficient for normal files and also works well with the
vectorized read path we're using.  We also have custom logic for the
parquet read path to make things fast for presto, so this made sense.

But I'm also interested in getting this into pig/hive/spark because we have
a lot of ETL and other use cases where we can sort our data to take
advantage of this.

I thought I'd reach out to you with a few of the issues I saw upon looking
into how the filter api works currently.

The filter2 has the right level of support for skipping the entire row
group, which is good, but what is missing is enough context to actually
read the dictionary for a column.  From the api, I have the column chunk
metadata, which includes the necessary offsets, but doesn't have a
reference to either the path of the file being processed or the file reader
to access the stream.

In our implementation with presto, we simply read the dictionary directly
after reading the footer, which does incur the cost of a second read when a
match is located, but we want the dictionary before we load the column data
into memory.  This could get somewhat complicated with the column
projection logic if we only want a single read.

Just wanted to see if you had any thoughts about this before I dive into
the necessary changes to the api to support this.

Thanks,
Dan

Reply via email to