Hi Thejas,

Ticket created: https://issues.apache.org/jira/browse/PIG-1475

The exact error was:
2010-06-29 15:46:04,579 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2998: Unhandled internal error. null

Your example is close to what I'm doing, but I have to do one grouping step
after the union, like so:

L = load 'f1';
F1 = filter L by exp1 OR exp2 ... exp250 ;
F2 = filter L by exp251 OR exp252... ;
COMBINED = UNION F1,F2;
GROUPED = GROUP COMBINED BY (....) PARALLEL ###;
COUNTS = FOREACH GROUPED GENERATE FLATTEN($0), COUNT($1) AS COUNT;
STORE COUNTS INTO '$OUTPUT';

not sure if that makes a difference.  Also, the error only happens in script
mode, not when I'm testing in local mode.

-Mike

On Tue, Jun 29, 2010 at 2:54 PM, Thejas Nair <[email protected]> wrote:

> What is the internal error you are getting (details might be in the log
> file) ? This does not sound like a known issue. (A new JIRA would be even
> more useful!)
>
> Your workaround of using union should do what you want. I am assuming that
> you have a filter with OR of the regular expression matches.
> Ie -
> L = load 'f1';
> Filter1 = filter L by exp1 OR exp2 ... exp250 ;
> Filter2 = filter L by exp251 OR exp252... ;
> OUTPUT = UNION Filter1, Filter2;
>
> Only a single MR job is needed for above query, so there should not be much
> of performance degradation due to the workaround.
>
> If you generate the report for whole dataset and then use a filter script,
> you would end up doing an additional read/write of the larger dataset.
>
>
> Thanks,
> Thejas
>
>
>
> On 6/29/10 11:13 AM, "Mike Subelsky" <[email protected]> wrote:
>
> > Hello,
> >
> > Does this make sense?  I'm generate reports using Pig where I only want
> to
> > report on rows matching a set of regular expressions, but those regular
> > expressions are pretty numerous. Some reports have 500 matching clauses
> and
> > others 6000 matching clauses.
> >
> > Pig fails with an internal error when I run FILTER with the 500 terms
> > through, so I split that into two chunks of 250 terms and UNION the
> results.
> >  It works great, but is that the sensible thing to do or am I missing
> > something obvious?
> >
> > I haven't tried the 6000 term report yet.  I don't know what percentage
> of
> > the data that represents, but I'm tempted to get rid of the FILTER
> statement
> > and generate my report for the whole data set, then use a quick script to
> > select out the 6000 terms, but somehow that seems like "cheating".
> >  Otherwise I'll repeat the above UNION technique.
> >
> > Using Hadoop 0.20.2 and Pig 0.6 on Amazon Elastic MR.
> >
> > thanks!
> >
> > -Mike
>
>


-- 
Mike Subelsky
oib.com // ignitebaltimore.com // subelsky.com
@subelsky // (410) 929-4022

Reply via email to