I agree with Aljoscha that we should have further numbers on this (independently of what we settle on with the comparators).
@Chesnay: do you feel like giving it a try or is the Python mystery eating up all your time? ;-) On Wed, Aug 27, 2014 at 10:27 PM, Aljoscha Krettek <[email protected]> wrote: > Ok, but that's micro benchmarks. How much time is really spent on > those comparisons when a real job is executed? Does anyone know? Also, > the tuple pair comparators never work on binary representation, right? > > On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler > <[email protected]> wrote: > > i cant recall definitely what the numbers were so I'll just quote myself > > from the PR: > > > > measurements were done using System.nanoTime() > > time necessary for the comparison > > Strings consisted of 90 characters > > > > difference in the beginning of the string > > New 4259 > > Old 23431 > > > > difference in the middle of the string > > New 13560 > > Old 30757 > > > > difference at the end of the string > > New 29385 > > Old 28293 > > > > for smaller strings (<10 i think) it was roughly 50% faster. > > > > > > > > On 27.8.2014 22:10, Aljoscha Krettek wrote: > >> > >> Ok, then we have to find some other solution for interop between > >> comparators. What's the performance increase from that pull request? > >> > >> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[email protected]> wrote: > >>> > >>> A lot of comparisons (and that is what makes it fast) can happen on the > >>> binary data, without turning them into a comparable. > >>> > >>> The latest pull request in the String representation for example > improved > >>> String-keyed sorting quite a bit: > >>> > >>> https://github.com/apache/incubator-flink/pull/4 > >>> > >>> > >>> > >>> > >>> > >>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[email protected] > > > >>> wrote: > >>> > >>>> It would work on binary data, for example for tuples it would become: > >>>> > >>>> public Comparable extractKeys(DataInputView in) { > >>>> Object[] fields = ... > >>>> // extract only relevant fields from binary input and save in > >>>> fields > >>>> return Tuple(fields) // something like that > >>>> } > >>>> > >>>> And for normalized keys something similar can be done. > >>>> > >>>> Aljoscha > >>>> > >>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[email protected]> > wrote: > >>>>> > >>>>> The design of the comparators so far was to make them work on the > >>>>> binary > >>>>> data. That we need to retain, in my opinion, otherwise there is no > >>>>> way to get good performance out of working on serialized data. > >>>>> > >>>>> I personally think that creating a tuple2 (key/value pair) when using > >>>>> selector functions is actually good: > >>>>> The key type (being treated by its dedicated comparator) benefits > from > >>>> > >>>> all > >>>>> > >>>>> the optimizations implemented for that type (bin copying, normalized > >>>> > >>>> keys, > >>>>> > >>>>> ...) > >>>>> That would be very hard to incorporate into any comparator that just > >>>>> deserializes some comparable. > >>>>> > >>>>> Also, the key extractor can contain sort of heavy magic (such as to > >>>>> block > >>>>> keys), whatever a user put in there. If we put that into the > >>>>> comparator, > >>>> > >>>> it > >>>>> > >>>>> gets called for > >>>>> every comparison! > >>>>> > >>>>> I do agree, though, that we need to come up with a better interface > >>>>> that > >>>>> seamlessly allows working on binary versions and on objects, without > >>>>> duplicating too much code. > >>>>> > >>>>> From your suggestion, I am not sure I got everything. Could you > post a > >>>>> concrete example or code? > >>>>> > >>>>> Stephan > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek < > [email protected]> > >>>>> wrote: > >>>>> > >>>>>> Hi Guys, > >>>>>> while porting the Java API to Scala I'm noticing how complicated > >>>>>> things are because of how our TypeComparators work: 1) There is only > >>>>>> one type of comparator per TypeInformation which is created by the > >>>>>> TypeInformation. Therefore, our KeySelectors are not actually > >>>>>> implemented as comparators but as generated mappers that emit a > >>>>>> Tuple2, because you wouldn't for example be able to generate a > >>>>>> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot > >>>>>> of magic going on with wrapping and unwrapping those tuples in > Reduce, > >>>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there > >>>>>> is special case code for the combinations that work. This will only > >>>>>> get worse when we properly introduce POJO types, which should work > >>>>>> well with tuple comparators and the other comparators. > >>>>>> > >>>>>> My proposal is this: No more TypeComparator on a per type basis. > Just > >>>>>> a generic comparator and PairComparator that work on Comparable. > What > >>>>>> used to be TypeComparators become SelectionExtractors that return a > >>>>>> Comparable. Make Tuple comparable or add new ComparableTuple. The > >>>>>> TupleSelectionExtractor would return a comparable tuple of the > >>>>>> appropriate length (same for POJOs). For Tuple extractors that > operate > >>>>>> on only one field they would immediately return that field, without > >>>>>> wrapping it in a tuple. This would directly support our existing > >>>>>> KeySelector functions since the already return Comparable, when > >>>>>> returning a tuple in a key selector function this would be > compatible > >>>>>> with a TupleSelectionExtractor (on the other join side, for > example). > >>>>>> > >>>>>> That's my idea. What do you think? I think the current state is not > >>>>>> maintainable, so we should do something quickly. :D > >>>>>> > >>>>>> Cheers, > >>>>>> Aljoscha > >>>>>> > > >
