Hello,

Does this make sense?  I'm generate reports using Pig where I only want to
report on rows matching a set of regular expressions, but those regular
expressions are pretty numerous. Some reports have 500 matching clauses and
others 6000 matching clauses.

Pig fails with an internal error when I run FILTER with the 500 terms
through, so I split that into two chunks of 250 terms and UNION the results.
 It works great, but is that the sensible thing to do or am I missing
something obvious?

I haven't tried the 6000 term report yet.  I don't know what percentage of
the data that represents, but I'm tempted to get rid of the FILTER statement
and generate my report for the whole data set, then use a quick script to
select out the 6000 terms, but somehow that seems like "cheating".
 Otherwise I'll repeat the above UNION technique.

Using Hadoop 0.20.2 and Pig 0.6 on Amazon Elastic MR.

thanks!

-Mike

-- 
Mike Subelsky
oib.com // ignitebaltimore.com // subelsky.com
@subelsky

Reply via email to