[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

Dmitriy V. Ryaboy (JIRA) Wed, 12 Aug 2009 13:35:41 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742562#action_12742562
 ]


Dmitriy V. Ryaboy commented on PIG-845:
---------------------------------------

Alan, Ashutosh -- maybe I am misunderstanding where null keys come from in the 
Indexer. I assumed this was due to the processing that happens in the plan the 
indexer deserializes and attaches to its POLocalRearrange.

In regards to errors, I was referring to this:
{code}
        catch(PlanException e){
            int errCode = 2034;
            String msg = "Error compiling operator " + 
joinOp.getClass().getCanonicalName();
            throw new MRCompilerException(msg, errCode, PigException.BUG, e);
{code}

The only central place for error codes seems to be the Wiki.  A class with a 
bunch of static+final error codes would be a better place.


Ashutosh, I completely disagree with you on changing all tests to run in MR 
mode.  The tests are already impossible to run on a laptop (people, myself 
included, actually submit patches to jira just to see if tests pass).  Running 
in MR mode will incur significant overhead per test. Only things that actually 
rely on the MR bits should be tested in MR (and use mock objects if possible.. 
there's been some advancement on that front in Hadoop 20, I haven't looked at 
it yet).

Would love to see a more efficient indexing MR job (which will reduce load on 
the JT, keep schedules less busy, and incur less overhead in task startups by 
requiring fewer tasks), but perhaps not before 0.4 is out the door with 
existing functionality.  Just to be clear, I don't think more than 1 record per 
block is necessary, but more than one block per task would probably be a good 
thing.

Any thoughts on how to choose which of two relations to index? We get locality 
on the non-indexed relation, but not on the indexed one, which probably throws 
a kink in the normal way of thinking about this.



> PERFORMANCE: Merge Join
> -----------------------
>
>                 Key: PIG-845
>                 URL: https://issues.apache.org/jira/browse/PIG-845
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Ashutosh Chauhan
>         Attachments: merge-join.patch
>
>
> Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

Reply via email to