Slightly digressing and possibly rambling - feel free to ignore !
Making it a general problem when both lists are 'large' (too large to
fit into memory).
A general solution for this, when the list of blacklist emails, is an
interesting problem. Probably something which might benefit from the
Accumulator interface (not sure if it is relevant here though).
Top of my head, thinking of something like :
cross, add match_check field, group by email, check if any match found
in bag.
Note, the cross is too damn expensive anyway ... but atleast it is
distributable.
This last operation will be on bag's as large as spam list - and since
contention is that is too large, pig cant 'normally' solve it (will run
out of mem) : Accumulator interface based filter (if possible here)
should help us with it ... (you return true as soon as any match is found).
Another possible alternative would be to write a algebraic udf which
returns 1 as soon as it sees a match, else 0 - intermediate output :
reducing final output size : and follow the foreach by a filter for this
field.
But in any case, the CROSS is killer - and loading everything into
memory is not always practical ... wondering how to solve this as a
general problem (there are specific solutions though).
Regards,
Mridul
On Friday 26 February 2010 12:44 AM, Dmitriy Ryaboy wrote:
Bill, that doesn't work if he's trying to do a join to a table of
blacklisted patterns.
Jan, because of the fundamental way Map-Reduce works, Joins work on equality
operators. If your blacklist is not huge (just a few megs perhaps?) you can
just put the file containing your blacklist in HDFS, use the cache directive
to make sure your worker nodes are prepped to use it efficiently, and then
write a UDF that will take one of your strings and run it through the
blacklist to check if any entries match. You could then filter by this UDF.
This could be done reasonably efficiently. Check out LookupInFiles (in the
piggybank) for something similar.
-D
On Thu, Feb 25, 2010 at 11:03 AM, Bill Graham<[email protected]> wrote:
You could specify a condition using the the RegexMatch or RegexExtract UDF
in piggybank:
http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexMatch.java
http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexExtract.java
On Thu, Feb 25, 2010 at 10:17 AM, Jan Zimmek<[email protected]>
wrote:
hi,
i recently found pig, really like it and want to use it for one of our
actual projects.
getting the basics running was easy, but now i am struggling one a
problem.
i am trying to get customers whose email is not blacklisted.
blacklist entires can be specified as:
[email protected]
or wildcarded
@domain.de
in sql i would solve this by:
----
select
*
from
customer c
left join blacklist b
on
c.email like concat("%",b.email)
where
b.email is null
----
this is the structure of my input files:
raw_customer = LOAD 'customer.csv' USING PigStorage('\t') AS (id: long,
email: chararray);
raw_blacklist = LOAD 'blacklist.csv' USING PigStorage('\t') AS (email:
chararray);
how would i solve this using pig ? - especially handling the "like %"
condition.
i already looked into udf, but need some advice how to implement this.
any help would be really appreciated.
regards,
jan