Re: filter/join by sql like "%pattern" condition

Mridul Muralidharan Sun, 28 Feb 2010 13:15:46 -0800


Slightly digressing and possibly rambling - feel free to ignore !

Making it a general problem when both lists are 'large' (too large tofit into memory).

A general solution for this, when the list of blacklist emails, is aninteresting problem. Probably something which might benefit from theAccumulator interface (not sure if it is relevant here though).


Top of my head, thinking of something like :

cross, add match_check field, group by email, check if any match foundin bag.

Note, the cross is too damn expensive anyway ... but atleast it isdistributable.

This last operation will be on bag's as large as spam list - and sincecontention is that is too large, pig cant 'normally' solve it (will runout of mem) : Accumulator interface based filter (if possible here)should help us with it ... (you return true as soon as any match is found).

Another possible alternative would be to write a algebraic udf whichreturns 1 as soon as it sees a match, else 0 - intermediate output :reducing final output size : and follow the foreach by a filter for thisfield.

But in any case, the CROSS is killer - and loading everything intomemory is not always practical ... wondering how to solve this as ageneral problem (there are specific solutions though).



Regards,
Mridul

On Friday 26 February 2010 12:44 AM, Dmitriy Ryaboy wrote:

Bill, that doesn't work if he's trying to do a join to a table of
blacklisted patterns.

Jan, because of the fundamental way Map-Reduce works, Joins work on equality
operators. If your blacklist is not huge (just a few megs perhaps?) you can
just put the file containing your blacklist in HDFS, use the cache directive
to make sure your worker nodes are prepped to use it efficiently, and then
write a UDF that will take one of your strings and run it through the
blacklist to check if any entries match. You could then filter by this UDF.
This could be done reasonably efficiently. Check out LookupInFiles (in the
piggybank) for something similar.

-D



On Thu, Feb 25, 2010 at 11:03 AM, Bill Graham<[email protected]>  wrote:

You could specify a condition using the the RegexMatch or RegexExtract UDF
in piggybank:


http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexMatch.java


http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexExtract.java

On Thu, Feb 25, 2010 at 10:17 AM, Jan Zimmek<[email protected]>
wrote:

hi,

i recently found pig, really like it and want to use it for one of our
actual projects.

getting the basics running was easy, but now i am struggling one a

problem.


i am trying to get customers whose email is not blacklisted.

blacklist entires can be specified as:

[email protected]

or wildcarded

@domain.de

in sql i would solve this by:

----

select
  *
from
  customer c
left join blacklist b
on
  c.email like concat("%",b.email)
where
  b.email is null

----

this is the structure of my input files:

raw_customer = LOAD 'customer.csv' USING PigStorage('\t') AS (id: long,
email: chararray);
raw_blacklist = LOAD 'blacklist.csv' USING PigStorage('\t') AS (email:
chararray);


how would i solve this using pig ? - especially handling the "like %"
condition.

i already looked into udf, but need some advice how to implement this.


any help would be really appreciated.

regards,
jan

Re: filter/join by sql like "%pattern" condition

Reply via email to