I agree with Aljoscha that we should have further numbers on this
(independently of what we settle on with the comparators).

@Chesnay: do you feel like giving it a try or is the Python mystery eating
up all your time? ;-)


On Wed, Aug 27, 2014 at 10:27 PM, Aljoscha Krettek <[email protected]>
wrote:

> Ok, but that's micro benchmarks. How much time is really spent on
> those comparisons when a real job is executed? Does anyone know? Also,
> the tuple pair comparators never work on binary representation, right?
>
> On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler
> <[email protected]> wrote:
> > i cant recall definitely what the numbers were so I'll just quote myself
> > from the PR:
> >
> > measurements were done using System.nanoTime()
> > time necessary for the comparison
> > Strings consisted of 90 characters
> >
> > difference in the beginning of the string
> > New 4259
> > Old 23431
> >
> > difference in the middle of the string
> > New 13560
> > Old 30757
> >
> > difference at the end of the string
> > New 29385
> > Old 28293
> >
> > for smaller strings (<10 i think) it was roughly 50% faster.
> >
> >
> >
> > On 27.8.2014 22:10, Aljoscha Krettek wrote:
> >>
> >> Ok, then we have to find some other solution for interop between
> >> comparators. What's the performance increase from that pull request?
> >>
> >> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[email protected]> wrote:
> >>>
> >>> A lot of comparisons (and that is what makes it fast) can happen on the
> >>> binary data, without turning them into a comparable.
> >>>
> >>> The latest pull request in the String representation for example
> improved
> >>> String-keyed sorting quite a bit:
> >>>
> >>> https://github.com/apache/incubator-flink/pull/4
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[email protected]
> >
> >>> wrote:
> >>>
> >>>> It would work on binary data, for example for tuples it would become:
> >>>>
> >>>> public Comparable extractKeys(DataInputView in) {
> >>>>      Object[] fields = ...
> >>>>      // extract only relevant fields from binary input and save in
> >>>> fields
> >>>>      return Tuple(fields) // something like that
> >>>> }
> >>>>
> >>>> And for normalized keys something similar can be done.
> >>>>
> >>>> Aljoscha
> >>>>
> >>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[email protected]>
> wrote:
> >>>>>
> >>>>> The design of the comparators so far was to make them work on the
> >>>>> binary
> >>>>> data. That we need to retain, in my opinion, otherwise there is no
> >>>>> way to get good performance out of working on serialized data.
> >>>>>
> >>>>> I personally think that creating a tuple2 (key/value pair) when using
> >>>>> selector functions is actually good:
> >>>>> The key type (being treated by its dedicated comparator) benefits
> from
> >>>>
> >>>> all
> >>>>>
> >>>>> the optimizations implemented for that type (bin copying, normalized
> >>>>
> >>>> keys,
> >>>>>
> >>>>> ...)
> >>>>> That would be very hard to incorporate into any comparator that just
> >>>>> deserializes some comparable.
> >>>>>
> >>>>> Also, the key extractor can contain sort of heavy magic (such as to
> >>>>> block
> >>>>> keys), whatever a user put in there. If we put that into the
> >>>>> comparator,
> >>>>
> >>>> it
> >>>>>
> >>>>> gets called for
> >>>>> every comparison!
> >>>>>
> >>>>> I do agree, though, that we need to come up with a better interface
> >>>>> that
> >>>>> seamlessly allows working on binary versions and on objects, without
> >>>>> duplicating too much code.
> >>>>>
> >>>>>  From your suggestion, I am not sure I got everything. Could you
> post a
> >>>>> concrete example or code?
> >>>>>
> >>>>> Stephan
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <
> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Guys,
> >>>>>> while porting the Java API to Scala I'm noticing how complicated
> >>>>>> things are because of how our TypeComparators work: 1) There is only
> >>>>>> one type of comparator per TypeInformation which is created by the
> >>>>>> TypeInformation. Therefore, our KeySelectors are not actually
> >>>>>> implemented as comparators but as generated mappers that emit a
> >>>>>> Tuple2, because you wouldn't for example be able to generate a
> >>>>>> SelectorFunctionComparator for a TupleTypeInfo.  (There's also a lot
> >>>>>> of magic going on with wrapping and unwrapping those tuples in
> Reduce,
> >>>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there
> >>>>>> is special case code for the combinations that work. This will only
> >>>>>> get worse when we properly introduce POJO types, which should work
> >>>>>> well with tuple comparators and the other comparators.
> >>>>>>
> >>>>>> My proposal is this: No more TypeComparator on a per type basis.
> Just
> >>>>>> a generic comparator and PairComparator that work on Comparable.
> What
> >>>>>> used to be TypeComparators become SelectionExtractors that return a
> >>>>>> Comparable. Make Tuple comparable or add new ComparableTuple.  The
> >>>>>> TupleSelectionExtractor would return a comparable tuple of the
> >>>>>> appropriate length (same for POJOs). For Tuple extractors that
> operate
> >>>>>> on only one field they would immediately return that field, without
> >>>>>> wrapping it in a tuple. This would directly support our existing
> >>>>>> KeySelector functions since the already return Comparable, when
> >>>>>> returning a tuple in a key selector function this would be
> compatible
> >>>>>> with a TupleSelectionExtractor (on the other join side, for
> example).
> >>>>>>
> >>>>>> That's my idea. What do you think? I think the current state is not
> >>>>>> maintainable, so we should do something quickly. :D
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Aljoscha
> >>>>>>
> >
>

Reply via email to