[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:
----------------------------

    Attachment: PIG-1295_0.16.patch

Discuss with Alan, we agree to put getRawComparator into TupleFactory. Attach 
PIG-1295_0.16.patch. 

> Binary comparator for secondary sort
> ------------------------------------
>
>                 Key: PIG-1295
>                 URL: https://issues.apache.org/jira/browse/PIG-1295
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Gianmarco De Francisci Morales
>             Fix For: 0.8.0
>
>         Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
> PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
> PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, 
> PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, 
> PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, 
> PIG-1295_0.8.patch, PIG-1295_0.9.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of 
> comparator if available. The benefit of binary comparator is we do not need 
> to instantiate the object before we compare. We see a ~30% speedup after we 
> switch to binary comparator. Currently, Pig use binary comparator in 
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need 
> to do a sort in order to filter out duplicate values; however, we do not care 
> how comparator sort keys. Groupby also share this character. In this case, we 
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we 
> have implementation for simple types, such as integer, long, float, 
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have 
> a binary comparator implementation. This especially matters when we switch to 
> use secondary sort. In secondary sort, we convert the inner sort of nested 
> foreach into the secondary key and rely on hadoop to sorting on both main key 
> and secondary key. The sorting key will become a two items tuple. Since the 
> secondary key the sorting key of the nested foreach, so the sorting semantics 
> matters. It turns out we do not have binary comparator once we use secondary 
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary 
> structure of the serialized tuple. We can focus on most common use cases 
> first, which is "group by" followed by a nested sort. In this case, we will 
> use secondary sort. Semantics of the first key does not matter but semantics 
> of secondary key matters. We need to identify the boundary of main key and 
> secondary key in the binary tuple buffer without instantiate tuple itself. 
> Then if the first key equals, we use a binary comparator to compare secondary 
> key. Secondary key can also be a complex data type, but for the first step, 
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010" 
> program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to