Ok, but that's micro benchmarks. How much time is really spent on those comparisons when a real job is executed? Does anyone know? Also, the tuple pair comparators never work on binary representation, right?
On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler <[email protected]> wrote: > i cant recall definitely what the numbers were so I'll just quote myself > from the PR: > > measurements were done using System.nanoTime() > time necessary for the comparison > Strings consisted of 90 characters > > difference in the beginning of the string > New 4259 > Old 23431 > > difference in the middle of the string > New 13560 > Old 30757 > > difference at the end of the string > New 29385 > Old 28293 > > for smaller strings (<10 i think) it was roughly 50% faster. > > > > On 27.8.2014 22:10, Aljoscha Krettek wrote: >> >> Ok, then we have to find some other solution for interop between >> comparators. What's the performance increase from that pull request? >> >> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[email protected]> wrote: >>> >>> A lot of comparisons (and that is what makes it fast) can happen on the >>> binary data, without turning them into a comparable. >>> >>> The latest pull request in the String representation for example improved >>> String-keyed sorting quite a bit: >>> >>> https://github.com/apache/incubator-flink/pull/4 >>> >>> >>> >>> >>> >>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[email protected]> >>> wrote: >>> >>>> It would work on binary data, for example for tuples it would become: >>>> >>>> public Comparable extractKeys(DataInputView in) { >>>> Object[] fields = ... >>>> // extract only relevant fields from binary input and save in >>>> fields >>>> return Tuple(fields) // something like that >>>> } >>>> >>>> And for normalized keys something similar can be done. >>>> >>>> Aljoscha >>>> >>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[email protected]> wrote: >>>>> >>>>> The design of the comparators so far was to make them work on the >>>>> binary >>>>> data. That we need to retain, in my opinion, otherwise there is no >>>>> way to get good performance out of working on serialized data. >>>>> >>>>> I personally think that creating a tuple2 (key/value pair) when using >>>>> selector functions is actually good: >>>>> The key type (being treated by its dedicated comparator) benefits from >>>> >>>> all >>>>> >>>>> the optimizations implemented for that type (bin copying, normalized >>>> >>>> keys, >>>>> >>>>> ...) >>>>> That would be very hard to incorporate into any comparator that just >>>>> deserializes some comparable. >>>>> >>>>> Also, the key extractor can contain sort of heavy magic (such as to >>>>> block >>>>> keys), whatever a user put in there. If we put that into the >>>>> comparator, >>>> >>>> it >>>>> >>>>> gets called for >>>>> every comparison! >>>>> >>>>> I do agree, though, that we need to come up with a better interface >>>>> that >>>>> seamlessly allows working on binary versions and on objects, without >>>>> duplicating too much code. >>>>> >>>>> From your suggestion, I am not sure I got everything. Could you post a >>>>> concrete example or code? >>>>> >>>>> Stephan >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Guys, >>>>>> while porting the Java API to Scala I'm noticing how complicated >>>>>> things are because of how our TypeComparators work: 1) There is only >>>>>> one type of comparator per TypeInformation which is created by the >>>>>> TypeInformation. Therefore, our KeySelectors are not actually >>>>>> implemented as comparators but as generated mappers that emit a >>>>>> Tuple2, because you wouldn't for example be able to generate a >>>>>> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot >>>>>> of magic going on with wrapping and unwrapping those tuples in Reduce, >>>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there >>>>>> is special case code for the combinations that work. This will only >>>>>> get worse when we properly introduce POJO types, which should work >>>>>> well with tuple comparators and the other comparators. >>>>>> >>>>>> My proposal is this: No more TypeComparator on a per type basis. Just >>>>>> a generic comparator and PairComparator that work on Comparable. What >>>>>> used to be TypeComparators become SelectionExtractors that return a >>>>>> Comparable. Make Tuple comparable or add new ComparableTuple. The >>>>>> TupleSelectionExtractor would return a comparable tuple of the >>>>>> appropriate length (same for POJOs). For Tuple extractors that operate >>>>>> on only one field they would immediately return that field, without >>>>>> wrapping it in a tuple. This would directly support our existing >>>>>> KeySelector functions since the already return Comparable, when >>>>>> returning a tuple in a key selector function this would be compatible >>>>>> with a TupleSelectionExtractor (on the other join side, for example). >>>>>> >>>>>> That's my idea. What do you think? I think the current state is not >>>>>> maintainable, so we should do something quickly. :D >>>>>> >>>>>> Cheers, >>>>>> Aljoscha >>>>>> >
