Re: Changing how TypeComparators Work

Aljoscha Krettek Wed, 27 Aug 2014 13:28:20 -0700

Ok, but that's micro benchmarks. How much time is really spent on
those comparisons when a real job is executed? Does anyone know? Also,
the tuple pair comparators never work on binary representation, right?


On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler
<[email protected]> wrote:
> i cant recall definitely what the numbers were so I'll just quote myself
> from the PR:
>
> measurements were done using System.nanoTime()
> time necessary for the comparison
> Strings consisted of 90 characters
>
> difference in the beginning of the string
> New 4259
> Old 23431
>
> difference in the middle of the string
> New 13560
> Old 30757
>
> difference at the end of the string
> New 29385
> Old 28293
>
> for smaller strings (<10 i think) it was roughly 50% faster.
>
>
>
> On 27.8.2014 22:10, Aljoscha Krettek wrote:
>>
>> Ok, then we have to find some other solution for interop between
>> comparators. What's the performance increase from that pull request?
>>
>> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[email protected]> wrote:
>>>
>>> A lot of comparisons (and that is what makes it fast) can happen on the
>>> binary data, without turning them into a comparable.
>>>
>>> The latest pull request in the String representation for example improved
>>> String-keyed sorting quite a bit:
>>>
>>> https://github.com/apache/incubator-flink/pull/4
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[email protected]>
>>> wrote:
>>>
>>>> It would work on binary data, for example for tuples it would become:
>>>>
>>>> public Comparable extractKeys(DataInputView in) {
>>>>      Object[] fields = ...
>>>>      // extract only relevant fields from binary input and save in
>>>> fields
>>>>      return Tuple(fields) // something like that
>>>> }
>>>>
>>>> And for normalized keys something similar can be done.
>>>>
>>>> Aljoscha
>>>>
>>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[email protected]> wrote:
>>>>>
>>>>> The design of the comparators so far was to make them work on the
>>>>> binary
>>>>> data. That we need to retain, in my opinion, otherwise there is no
>>>>> way to get good performance out of working on serialized data.
>>>>>
>>>>> I personally think that creating a tuple2 (key/value pair) when using
>>>>> selector functions is actually good:
>>>>> The key type (being treated by its dedicated comparator) benefits from
>>>>
>>>> all
>>>>>
>>>>> the optimizations implemented for that type (bin copying, normalized
>>>>
>>>> keys,
>>>>>
>>>>> ...)
>>>>> That would be very hard to incorporate into any comparator that just
>>>>> deserializes some comparable.
>>>>>
>>>>> Also, the key extractor can contain sort of heavy magic (such as to
>>>>> block
>>>>> keys), whatever a user put in there. If we put that into the
>>>>> comparator,
>>>>
>>>> it
>>>>>
>>>>> gets called for
>>>>> every comparison!
>>>>>
>>>>> I do agree, though, that we need to come up with a better interface
>>>>> that
>>>>> seamlessly allows working on binary versions and on objects, without
>>>>> duplicating too much code.
>>>>>
>>>>>  From your suggestion, I am not sure I got everything. Could you post a
>>>>> concrete example or code?
>>>>>
>>>>> Stephan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Guys,
>>>>>> while porting the Java API to Scala I'm noticing how complicated
>>>>>> things are because of how our TypeComparators work: 1) There is only
>>>>>> one type of comparator per TypeInformation which is created by the
>>>>>> TypeInformation. Therefore, our KeySelectors are not actually
>>>>>> implemented as comparators but as generated mappers that emit a
>>>>>> Tuple2, because you wouldn't for example be able to generate a
>>>>>> SelectorFunctionComparator for a TupleTypeInfo.  (There's also a lot
>>>>>> of magic going on with wrapping and unwrapping those tuples in Reduce,
>>>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there
>>>>>> is special case code for the combinations that work. This will only
>>>>>> get worse when we properly introduce POJO types, which should work
>>>>>> well with tuple comparators and the other comparators.
>>>>>>
>>>>>> My proposal is this: No more TypeComparator on a per type basis. Just
>>>>>> a generic comparator and PairComparator that work on Comparable. What
>>>>>> used to be TypeComparators become SelectionExtractors that return a
>>>>>> Comparable. Make Tuple comparable or add new ComparableTuple.  The
>>>>>> TupleSelectionExtractor would return a comparable tuple of the
>>>>>> appropriate length (same for POJOs). For Tuple extractors that operate
>>>>>> on only one field they would immediately return that field, without
>>>>>> wrapping it in a tuple. This would directly support our existing
>>>>>> KeySelector functions since the already return Comparable, when
>>>>>> returning a tuple in a key selector function this would be compatible
>>>>>> with a TupleSelectionExtractor (on the other join side, for example).
>>>>>>
>>>>>> That's my idea. What do you think? I think the current state is not
>>>>>> maintainable, so we should do something quickly. :D
>>>>>>
>>>>>> Cheers,
>>>>>> Aljoscha
>>>>>>
>

Re: Changing how TypeComparators Work

Reply via email to