[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878120#action_12878120
]
Gianmarco De Francisci Morales commented on PIG-1295:
-----------------------------------------------------
I did some more detailed profiling and testing. I ran a slightly modified
version of the performance tests in the patch.
On my machine, the new comparator takes between 8 and 10 seconds. The old one
between 11 and 14.
The performance advantage is not so big, but I found also that most of the time
of the test is not spent comparting.
The profiler shows that most of the time is spent in cloning (probably JUnit
internals?):
{code}
[junit] 84.0% 0 + 1769 java.lang.Object.clone
{code}
For what concerns user code instead, most of the time is spent writing data.
Randomization of the input also takes some time.
The old comparator takes (0.9% + 0.3% | compareTuple + compare) more than 1% of
the time. The new one takes only 0.2% of the time.
{code}
[junit] 2.9% 62 + 0
org.apache.pig.data.DataReaderWriter.writeDatum
[junit] 1.1% 23 + 0 java.io.DataOutputStream.writeInt
[junit] 0.9% 19 + 0 java.util.Random.next
[junit] 0.9% 18 + 0 java.io.DataOutputStream.writeLong
[junit] 0.9% 18 + 0
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparator.compareTuple
[junit] 0.6% 11 + 1
org.apache.pig.data.DataReaderWriter.readDatum
[junit] 0.5% 11 + 0 java.io.DataInputStream.readInt
[junit] 0.5% 11 + 0 org.apache.pig.data.DefaultTuple.readFields
[junit] 0.3% 7 + 0
org.apache.pig.impl.io.PigNullableWritable.write
[junit] 0.3% 7 + 0 org.apache.pig.data.DefaultTuple.<init>
[junit] 0.3% 3 + 3
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparator.compare
[junit] 0.2% 5 + 0 java.util.ArrayList.get
[junit] 0.2% 5 + 0 org.apache.pig.data.DefaultTuple.write
[junit] 0.2% 3 + 2
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparatorNew.compare
{code}
All in all, these tests are probably not much representative, but I also feel
that the speedup we may get with a raw comparator is somewhat limited.
I am open to suggestion on how to modify the performance tests to make them
more representative.
> Binary comparator for secondary sort
> ------------------------------------
>
> Key: PIG-1295
> URL: https://issues.apache.org/jira/browse/PIG-1295
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Daniel Dai
> Assignee: Daniel Dai
> Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch
>
>
> When hadoop framework doing the sorting, it will try to use binary version of
> comparator if available. The benefit of binary comparator is we do not need
> to instantiate the object before we compare. We see a ~30% speedup after we
> switch to binary comparator. Currently, Pig use binary comparator in
> following case:
> 1. When semantics of order doesn't matter. For example, in distinct, we need
> to do a sort in order to filter out duplicate values; however, we do not care
> how comparator sort keys. Groupby also share this character. In this case, we
> rely on hadoop's default binary comparator
> 2. Semantics of order matter, but the key is of simple type. In this case, we
> have implementation for simple types, such as integer, long, float,
> chararray, databytearray, string
> However, if the key is a tuple and the sort semantics matters, we do not have
> a binary comparator implementation. This especially matters when we switch to
> use secondary sort. In secondary sort, we convert the inner sort of nested
> foreach into the secondary key and rely on hadoop to sorting on both main key
> and secondary key. The sorting key will become a two items tuple. Since the
> secondary key the sorting key of the nested foreach, so the sorting semantics
> matters. It turns out we do not have binary comparator once we use secondary
> sort, and we see a significant slow down.
> Binary comparator for tuple should be doable once we understand the binary
> structure of the serialized tuple. We can focus on most common use cases
> first, which is "group by" followed by a nested sort. In this case, we will
> use secondary sort. Semantics of the first key does not matter but semantics
> of secondary key matters. We need to identify the boundary of main key and
> secondary key in the binary tuple buffer without instantiate tuple itself.
> Then if the first key equals, we use a binary comparator to compare secondary
> key. Secondary key can also be a complex data type, but for the first step,
> we focus on simple secondary key, which is the most common use case.
> We mark this issue to be a candidate project for "Google summer of code 2010"
> program.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.