[
https://issues.apache.org/jira/browse/PIG-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132212#comment-13132212
]
Alan Gates commented on PIG-2328:
---------------------------------
bq. Correct me if I am wrong, but this doesn't work if you use 2 different
bloom filters in a single task.
Glad you caught that. I'd meant to fix it and forgot.
bq. Why "contains" test for jenkins and murmur?
The Hadoop names for these are Hash.JENKINS_HASH and Hash.MURMUR_HASH. I
assumed people might copy some or all of those strings from the Hadoop docs and
use them, and I wanted it to work whether they used "jenkins" "jenkins_hash" or
"Hash.JENKINS_HASH"
On the definition by number of elements and desired accuracy I agree that would
be nice. I may put that in a follow on patch though, we'll see if I can finish
it in the next few days.
Same on operating directly on a relation. I'll see if I can get it working
soon, if not I may do it in a follow on patch.
> Add builtin UDFs for building and using bloom filters
> -----------------------------------------------------
>
> Key: PIG-2328
> URL: https://issues.apache.org/jira/browse/PIG-2328
> Project: Pig
> Issue Type: New Feature
> Components: internal-udfs
> Reporter: Alan Gates
> Assignee: Alan Gates
> Fix For: 0.10
>
> Attachments: PIG-bloom.patch
>
>
> Bloom filters are a common way to do select a limited set of records before
> moving data for a join or other heavy weight operation. Pig should add UDFs
> to support building and using bloom filters.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira