Re: Missing sync-up this week; Bloom filter status

俊杰陈 Wed, 16 Aug 2017 06:56:48 -0700

We have done 100GB scale benchmark for bloom filter vs dictionary filter
comparison. please see data in here:
https://docs.google.com/spreadsheets/d/1OIB920l9U_aCGXVeVIUUDL2chjxss7CrSLy6lUHfPaE/edit?usp=sharing
.


During benchmark we found dictionary filter can only works in very limited
cases due to :
1.  When cardinality is large, dictionary fallback to plain encoding.
2.  When cardinality is small, values probably exists in most row groups,
thus only few row groups can be filtered.


Regarding to enable dictionary PARQUET-1061, I found it is not a bug in
parquet side. Hive use deprecated API for its PPD cause the problem, I
filed HIVE-17261 and submitted patch.


2017-08-15 11:19 GMT+08:00 Jim Apple <[email protected]>:

> I'll be missing the Google Meet sync-up this week, so I wanted to
> share briefly where I think we are on PARQUET-41:
>
> I think we have agreement on the form that the filters will take,
> including the hash functions, but I believe we still don't have a
> benchmark against dictionary filtering. Last I heard, that was blocked
> by https://issues.apache.org/jira/browse/PARQUET-1061, though that
> ticket is now marked Resolved.
>



-- 
Thanks & Best Regards

Re: Missing sync-up this week; Bloom filter status

Reply via email to