[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

Pradeep Kamath (JIRA) Thu, 10 Sep 2009 14:11:22 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753816#action_12753816
 ]


Pradeep Kamath commented on PIG-953:
------------------------------------

Response to previous comment:

Dmitriy:  seekNear seems ambiguous, as "near" is a generic concept that does 
not necessarily imply "before or to, but not after" - which is what this method 
is required to do. How about "seekBefore()"?
Pradeep: I had initially thought of naming this method seekBefore(). However 
for the case where the key we are using to seek not being present in the right 
input, the loader can either position at the biggest value before the key OR 
the smallest value after the key (zebra loader for example can only do the 
latter). The name "seekBefore" suggests that implementations should always seek 
before the key in question - hence I chose seekNear.

Dmitriy: it looks like the large diff blocks in MergeSort et al are mostly 
moves of code blocks, not significant code changes, correct?
Pradeep: I am assuming you meant POMergeJoin - yes - its mostly code move to 
DefaultIndexableLoader


Dmitriy: Why does getAscColumns and getSortColumns make a copy of the list? 
Seems like we can save some memory and cpu here.
Pradeep: Findbugs complains about passing internal members as is in getters 
since the caller can then modifiy these internal members - hence the copy. This 
should not be a performance/memory issue since these copies are 1) on small 
structures 2) in front end at compile time and not at runtime.

Dmitriry: For that matter, why not use a map of (String)colName-> 
(Boolean)ascending instead of 2 lists? One structure, plus O(1) lookup.
Pradeep: This suggestion is reasonable - I picked the current implementation 
since its inline with how these things are represented today in LOSort.  Unless 
the user of SortInfo does lookup using sort column names, we won't get O(1). I 
designed this keeping zebra's use case (which is the only use case at this 
point) and the zebra store function would basically needs to know the sort keys 
in order and which of them are asc/dsc. For this they would iterate over our 
datastructure and require that the ordering of the keys match the 
primary/secondary order of the sort keys - hence a list lends itself better for 
that. I debated using a List<Pair<String,Boolean>> but thought I can avoid Pair 
since its a pig implementation class (if tomorrow zebra wants to expose the 
same SortInfo to its users).

Dmitriy: Not sure about the use of super() in the constructor of a class that 
doesn't extend anything but Object. Is there some magic that requires it?
Pradeep: This was unintended - all thanks to eclipse :) - will remove it a next 
iteration of this patch

Dmitriy: In Log2PhysTranslator, why hardcode the Limit operator? There are 
other operators that don't change sort order, such as filter. Perhaps add a 
method to Logical Operators that indicates if they alter sort order of their 
inputs?
Pradeep: Pig only guarantees order with limit following order - for any other 
relational operator following order there are no guarantees. Today it is true 
that filter or a column pruning foreach would also preserve order but this can 
change if needed in the future. There explicit code to ensure order-limit 
combination works by preserving order - there is no such explicit check for 
other operators (keeping it open for change in the future)

Dmitriy: Even with this rewrite, this seems like an odd function. It being as 
odd as it is leads to it not being used safely when you set checkEquality to 
false (just a few lines later)-- if obj1 is null and obj2 is not, the func 
returns true, you try to call a method on obj1, and get an NPE.
Pradeep: The rewrite is more terse and is another option as against the 
explicit if-else I have - more a coding style issue than correctness. The idea 
of having this function was to serve as a helper for use in Classes which need 
to implement equals().  It is cumbersone that every new class's equals has to 
do the equivalent of what you suggest:
{code}
Util.bothNull(obj1, obj2) || (Util.notNull(obj1, obj2) && obj1.equals(obj2));
{code}

This one method can be used either to only check that both objects are not null 
OR to do that and additionally check equality. Not sure I understand the oddity 
- am I missing something?

Dmitriy: This comment has a typo (and instead of "an"): 
Pradeep: Will fix in next iteration





> Enable merge join in pig to work with loaders and store functions which can 
> internally index sorted data 
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-953
>                 URL: https://issues.apache.org/jira/browse/PIG-953
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: PIG-953.patch
>
>
> Currently merge join implementation in pig includes construction of an index 
> on sorted data and use of that index to seek into the "right input" to 
> efficiently perform the join operation. Some loaders (notably the zebra 
> loader) internally implement an index on sorted data and can perform this 
> seek efficiently using their index. So the use of the index needs to be 
> abstracted in such a way that when the loader supports indexing, pig uses it 
> (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

Reply via email to