Hello, Does this make sense? I'm generate reports using Pig where I only want to report on rows matching a set of regular expressions, but those regular expressions are pretty numerous. Some reports have 500 matching clauses and others 6000 matching clauses.
Pig fails with an internal error when I run FILTER with the 500 terms through, so I split that into two chunks of 250 terms and UNION the results. It works great, but is that the sensible thing to do or am I missing something obvious? I haven't tried the 6000 term report yet. I don't know what percentage of the data that represents, but I'm tempted to get rid of the FILTER statement and generate my report for the whole data set, then use a quick script to select out the 6000 terms, but somehow that seems like "cheating". Otherwise I'll repeat the above UNION technique. Using Hadoop 0.20.2 and Pig 0.6 on Amazon Elastic MR. thanks! -Mike -- Mike Subelsky oib.com // ignitebaltimore.com // subelsky.com @subelsky
