Hi Thejas, Ticket created: https://issues.apache.org/jira/browse/PIG-1475
The exact error was: 2010-06-29 15:46:04,579 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. null Your example is close to what I'm doing, but I have to do one grouping step after the union, like so: L = load 'f1'; F1 = filter L by exp1 OR exp2 ... exp250 ; F2 = filter L by exp251 OR exp252... ; COMBINED = UNION F1,F2; GROUPED = GROUP COMBINED BY (....) PARALLEL ###; COUNTS = FOREACH GROUPED GENERATE FLATTEN($0), COUNT($1) AS COUNT; STORE COUNTS INTO '$OUTPUT'; not sure if that makes a difference. Also, the error only happens in script mode, not when I'm testing in local mode. -Mike On Tue, Jun 29, 2010 at 2:54 PM, Thejas Nair <[email protected]> wrote: > What is the internal error you are getting (details might be in the log > file) ? This does not sound like a known issue. (A new JIRA would be even > more useful!) > > Your workaround of using union should do what you want. I am assuming that > you have a filter with OR of the regular expression matches. > Ie - > L = load 'f1'; > Filter1 = filter L by exp1 OR exp2 ... exp250 ; > Filter2 = filter L by exp251 OR exp252... ; > OUTPUT = UNION Filter1, Filter2; > > Only a single MR job is needed for above query, so there should not be much > of performance degradation due to the workaround. > > If you generate the report for whole dataset and then use a filter script, > you would end up doing an additional read/write of the larger dataset. > > > Thanks, > Thejas > > > > On 6/29/10 11:13 AM, "Mike Subelsky" <[email protected]> wrote: > > > Hello, > > > > Does this make sense? I'm generate reports using Pig where I only want > to > > report on rows matching a set of regular expressions, but those regular > > expressions are pretty numerous. Some reports have 500 matching clauses > and > > others 6000 matching clauses. > > > > Pig fails with an internal error when I run FILTER with the 500 terms > > through, so I split that into two chunks of 250 terms and UNION the > results. > > It works great, but is that the sensible thing to do or am I missing > > something obvious? > > > > I haven't tried the 6000 term report yet. I don't know what percentage > of > > the data that represents, but I'm tempted to get rid of the FILTER > statement > > and generate my report for the whole data set, then use a quick script to > > select out the 6000 terms, but somehow that seems like "cheating". > > Otherwise I'll repeat the above UNION technique. > > > > Using Hadoop 0.20.2 and Pig 0.6 on Amazon Elastic MR. > > > > thanks! > > > > -Mike > > -- Mike Subelsky oib.com // ignitebaltimore.com // subelsky.com @subelsky // (410) 929-4022
