[jira] [Updated] (PIG-4652) [Pig on Tez] Key Comparison is slower than mapreduce

Rohini Palaniswamy (JIRA) Wed, 12 Aug 2015 11:50:51 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rohini Palaniswamy updated PIG-4652:
------------------------------------
    Description: 
Tez is using PigTupleSortComparator on both map and reduce side and in 
POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map and 
reduce side for comparing tuples which is byte only comparison and very fast.  
It then uses PigGrouping<DataType>WritableComparator as the grouping comparator 
to correctly group those keys. 

  It is not possible to use similar method in Tez (PigTupleWritableComparator 
for output and input and PigTupleSortComparator in POShuffleTezLoad), without 
addition of APIs in Tez to get raw bytes of the keys. Because when we compare 
multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be 
compared to maintain the same order as the mapside. In mapreduce, there was 
only single input and mapreduce framework sorted them together. But in Tez, the 
join inputs are sorted separately and the application only gets the serialized 
key. Need APIs in Tez KeyValuesReader to get the bytes of the current key as 
well which can be used in POShuffleTezLoad for min key comparison.



  was:
Tez is using PigTupleSortComparator on both map and reduce side and in 
POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map and 
reduce side for comparing tuples which is byte only comparison and very fast.  
It then uses PigGrouping<DataType>WritableComparator as the grouping comparator 
to correctly group those keys. 

  It is not possible to use similar method in Tez (PigTupleWritableComparator 
for output and input and PigTupleSortComparator in POShuffleTezLoad), without 
addition of APIs in Tez to get raw bytes of the keys. Because when we compare 
multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be 
compared to maintain the same order as the mapside. In mapreduce, there was 
only single input and mapreduce framework sorted them together. But in Tez, the 
join inputs are sorted separately and the application only gets the serialized 
key. Need APIs in Tez KeyValuesReader to get the bytes of the current key as 
well which can be used in POShuffleTezLoad for min key comparison.

  But the majority of the slowness of PigTupleSortComparator seems to be coming 
from inefficiency of String comparison in BinInterSedesTupleRawComparator which 
initializes String instead of comparing bytes like Text.Comparator. 

{code}
str1 = new String(bb1.array(), bb1.position(), casz1, BinInterSedes.UTF8);
str2 = new String(bb2.array(), bb2.position(), casz2, BinInterSedes.UTF8);
{code}

Fixing that should make performance very close to mapreduce with negligible 
difference. But following mapreduce like model, should make it even more 
efficient.



        Summary: [Pig on Tez] Key Comparison is slower than mapreduce  (was: 
[Pig on Tez] Group by on multiple keys is slower than mapreduce)

> [Pig on Tez] Key Comparison is slower than mapreduce
> ----------------------------------------------------
>
>                 Key: PIG-4652
>                 URL: https://issues.apache.org/jira/browse/PIG-4652
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>
> Tez is using PigTupleSortComparator on both map and reduce side and in 
> POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map 
> and reduce side for comparing tuples which is byte only comparison and very 
> fast.  It then uses PigGrouping<DataType>WritableComparator as the grouping 
> comparator to correctly group those keys. 
>   It is not possible to use similar method in Tez (PigTupleWritableComparator 
> for output and input and PigTupleSortComparator in POShuffleTezLoad), without 
> addition of APIs in Tez to get raw bytes of the keys. Because when we compare 
> multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be 
> compared to maintain the same order as the mapside. In mapreduce, there was 
> only single input and mapreduce framework sorted them together. But in Tez, 
> the join inputs are sorted separately and the application only gets the 
> serialized key. Need APIs in Tez KeyValuesReader to get the bytes of the 
> current key as well which can be used in POShuffleTezLoad for min key 
> comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4652) [Pig on Tez] Key Comparison is slower than mapreduce

Reply via email to