[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015449#comment-16015449 ]
Junjie Chen commented on PARQUET-41: ------------------------------------ Hi [~rdblue] We have a real use case from a Telecom company which has to store very detailed call record of subscribers. The schema of record mainly includes caller Tel number, receiver Tel number, phone IMEI, user id, location, ip, tunnel, ring_time, client_mac, server_mac, receiver phone type, imei, etc. total 90+ columns. Every row represents a call record and the data comes about 10 billion rows every day. The record in row group ( row group size is set to 256MB ) is mostly distinct. They have a very common requirement to retrieve records in seconds for simple conditional query, such as "select * from table where use_num='13xxxxxxxxx'". I think Bloom Filter should be more suitable than dictionary based filter consider the extra data space and data cardinality. What do you think? Many banks also have similar scenarios like this, they capture the transaction events and store to database, and execute conditional query afterward. > Add bloom filters to parquet statistics > --------------------------------------- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr > Reporter: Alex Levenson > Assignee: Ferdinand Xu > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian JIRA (v6.3.15#6346)