Re: Changing how TypeComparators Work

Stephan Ewen Thu, 28 Aug 2014 02:38:03 -0700

I think that the microbenchmarks are a very good start.

The serialization/deserialization is ultimately the Achilles Heel of our
approach. So that one, in itself, must be as fast as possible.



On Thu, Aug 28, 2014 at 10:22 AM, Chesnay Schepler <
[email protected]> wrote:

> @aljoscha : yes, the tuple comparators don't work on binary data.
>
>
> On 28.8.2014 0:10, Ufuk Celebi wrote:
>
>> I agree with Aljoscha that we should have further numbers on this
>> (independently of what we settle on with the comparators).
>>
>> @Chesnay: do you feel like giving it a try or is the Python mystery eating
>> up all your time? ;-)
>>
>>
>> On Wed, Aug 27, 2014 at 10:27 PM, Aljoscha Krettek <[email protected]>
>> wrote:
>>
>>  Ok, but that's micro benchmarks. How much time is really spent on
>>> those comparisons when a real job is executed? Does anyone know? Also,
>>> the tuple pair comparators never work on binary representation, right?
>>>
>>> On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler
>>> <[email protected]> wrote:
>>>
>>>> i cant recall definitely what the numbers were so I'll just quote myself
>>>> from the PR:
>>>>
>>>> measurements were done using System.nanoTime()
>>>> time necessary for the comparison
>>>> Strings consisted of 90 characters
>>>>
>>>> difference in the beginning of the string
>>>> New 4259
>>>> Old 23431
>>>>
>>>> difference in the middle of the string
>>>> New 13560
>>>> Old 30757
>>>>
>>>> difference at the end of the string
>>>> New 29385
>>>> Old 28293
>>>>
>>>> for smaller strings (<10 i think) it was roughly 50% faster.
>>>>
>>>>
>>>>
>>>> On 27.8.2014 22:10, Aljoscha Krettek wrote:
>>>>
>>>>> Ok, then we have to find some other solution for interop between
>>>>> comparators. What's the performance increase from that pull request?
>>>>>
>>>>> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> A lot of comparisons (and that is what makes it fast) can happen on
>>>>>> the
>>>>>> binary data, without turning them into a comparable.
>>>>>>
>>>>>> The latest pull request in the String representation for example
>>>>>>
>>>>> improved
>>>
>>>> String-keyed sorting quite a bit:
>>>>>>
>>>>>> https://github.com/apache/incubator-flink/pull/4
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <
>>>>>> [email protected]
>>>>>> wrote:
>>>>>>
>>>>>>  It would work on binary data, for example for tuples it would become:
>>>>>>>
>>>>>>> public Comparable extractKeys(DataInputView in) {
>>>>>>>       Object[] fields = ...
>>>>>>>       // extract only relevant fields from binary input and save in
>>>>>>> fields
>>>>>>>       return Tuple(fields) // something like that
>>>>>>> }
>>>>>>>
>>>>>>> And for normalized keys something similar can be done.
>>>>>>>
>>>>>>> Aljoscha
>>>>>>>
>>>>>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>
>>>> The design of the comparators so far was to make them work on the
>>>>>>>> binary
>>>>>>>> data. That we need to retain, in my opinion, otherwise there is no
>>>>>>>> way to get good performance out of working on serialized data.
>>>>>>>>
>>>>>>>> I personally think that creating a tuple2 (key/value pair) when
>>>>>>>> using
>>>>>>>> selector functions is actually good:
>>>>>>>> The key type (being treated by its dedicated comparator) benefits
>>>>>>>>
>>>>>>> from
>>>
>>>> all
>>>>>>>
>>>>>>>> the optimizations implemented for that type (bin copying, normalized
>>>>>>>>
>>>>>>> keys,
>>>>>>>
>>>>>>>> ...)
>>>>>>>> That would be very hard to incorporate into any comparator that just
>>>>>>>> deserializes some comparable.
>>>>>>>>
>>>>>>>> Also, the key extractor can contain sort of heavy magic (such as to
>>>>>>>> block
>>>>>>>> keys), whatever a user put in there. If we put that into the
>>>>>>>> comparator,
>>>>>>>>
>>>>>>> it
>>>>>>>
>>>>>>>> gets called for
>>>>>>>> every comparison!
>>>>>>>>
>>>>>>>> I do agree, though, that we need to come up with a better interface
>>>>>>>> that
>>>>>>>> seamlessly allows working on binary versions and on objects, without
>>>>>>>> duplicating too much code.
>>>>>>>>
>>>>>>>>   From your suggestion, I am not sure I got everything. Could you
>>>>>>>>
>>>>>>> post a
>>>
>>>> concrete example or code?
>>>>>>>>
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <
>>>>>>>>
>>>>>>> [email protected]>
>>>
>>>> wrote:
>>>>>>>>
>>>>>>>>  Hi Guys,
>>>>>>>>> while porting the Java API to Scala I'm noticing how complicated
>>>>>>>>> things are because of how our TypeComparators work: 1) There is
>>>>>>>>> only
>>>>>>>>> one type of comparator per TypeInformation which is created by the
>>>>>>>>> TypeInformation. Therefore, our KeySelectors are not actually
>>>>>>>>> implemented as comparators but as generated mappers that emit a
>>>>>>>>> Tuple2, because you wouldn't for example be able to generate a
>>>>>>>>> SelectorFunctionComparator for a TupleTypeInfo.  (There's also a
>>>>>>>>> lot
>>>>>>>>> of magic going on with wrapping and unwrapping those tuples in
>>>>>>>>>
>>>>>>>> Reduce,
>>>
>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there
>>>>>>>>> is special case code for the combinations that work. This will only
>>>>>>>>> get worse when we properly introduce POJO types, which should work
>>>>>>>>> well with tuple comparators and the other comparators.
>>>>>>>>>
>>>>>>>>> My proposal is this: No more TypeComparator on a per type basis.
>>>>>>>>>
>>>>>>>> Just
>>>
>>>> a generic comparator and PairComparator that work on Comparable.
>>>>>>>>>
>>>>>>>> What
>>>
>>>> used to be TypeComparators become SelectionExtractors that return a
>>>>>>>>> Comparable. Make Tuple comparable or add new ComparableTuple.  The
>>>>>>>>> TupleSelectionExtractor would return a comparable tuple of the
>>>>>>>>> appropriate length (same for POJOs). For Tuple extractors that
>>>>>>>>>
>>>>>>>> operate
>>>
>>>> on only one field they would immediately return that field, without
>>>>>>>>> wrapping it in a tuple. This would directly support our existing
>>>>>>>>> KeySelector functions since the already return Comparable, when
>>>>>>>>> returning a tuple in a key selector function this would be
>>>>>>>>>
>>>>>>>> compatible
>>>
>>>> with a TupleSelectionExtractor (on the other join side, for
>>>>>>>>>
>>>>>>>> example).
>>>
>>>> That's my idea. What do you think? I think the current state is not
>>>>>>>>> maintainable, so we should do something quickly. :D
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Aljoscha
>>>>>>>>>
>>>>>>>>>
>

Re: Changing how TypeComparators Work

Reply via email to