[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015449#comment-16015449
 ] 

Junjie Chen commented on PARQUET-41:
------------------------------------

Hi [~rdblue]
We have a real use case from a Telecom company which has to store very detailed 
call record of subscribers.  The schema of record mainly includes caller Tel 
number,  receiver Tel number, phone IMEI, user id, location, ip, tunnel, 
ring_time, client_mac, server_mac, receiver phone type, imei, etc. total 90+ 
columns. Every row represents a call record  and the data comes about 10 
billion rows every day. The record in row group ( row group size is set to 
256MB ) is mostly distinct. 

They have a very common requirement to retrieve records in seconds for simple 
conditional query, such as "select * from table where use_num='13xxxxxxxxx'". I 
think Bloom Filter should be more suitable than dictionary based filter 
consider the extra data space and data cardinality. What do you think? 

Many banks also have similar scenarios like this, they capture the transaction 
events and store to database, and execute conditional query afterward.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to