I think that the microbenchmarks are a very good start. The serialization/deserialization is ultimately the Achilles Heel of our approach. So that one, in itself, must be as fast as possible.
On Thu, Aug 28, 2014 at 10:22 AM, Chesnay Schepler < [email protected]> wrote: > @aljoscha : yes, the tuple comparators don't work on binary data. > > > On 28.8.2014 0:10, Ufuk Celebi wrote: > >> I agree with Aljoscha that we should have further numbers on this >> (independently of what we settle on with the comparators). >> >> @Chesnay: do you feel like giving it a try or is the Python mystery eating >> up all your time? ;-) >> >> >> On Wed, Aug 27, 2014 at 10:27 PM, Aljoscha Krettek <[email protected]> >> wrote: >> >> Ok, but that's micro benchmarks. How much time is really spent on >>> those comparisons when a real job is executed? Does anyone know? Also, >>> the tuple pair comparators never work on binary representation, right? >>> >>> On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler >>> <[email protected]> wrote: >>> >>>> i cant recall definitely what the numbers were so I'll just quote myself >>>> from the PR: >>>> >>>> measurements were done using System.nanoTime() >>>> time necessary for the comparison >>>> Strings consisted of 90 characters >>>> >>>> difference in the beginning of the string >>>> New 4259 >>>> Old 23431 >>>> >>>> difference in the middle of the string >>>> New 13560 >>>> Old 30757 >>>> >>>> difference at the end of the string >>>> New 29385 >>>> Old 28293 >>>> >>>> for smaller strings (<10 i think) it was roughly 50% faster. >>>> >>>> >>>> >>>> On 27.8.2014 22:10, Aljoscha Krettek wrote: >>>> >>>>> Ok, then we have to find some other solution for interop between >>>>> comparators. What's the performance increase from that pull request? >>>>> >>>>> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[email protected]> >>>>> wrote: >>>>> >>>>>> A lot of comparisons (and that is what makes it fast) can happen on >>>>>> the >>>>>> binary data, without turning them into a comparable. >>>>>> >>>>>> The latest pull request in the String representation for example >>>>>> >>>>> improved >>> >>>> String-keyed sorting quite a bit: >>>>>> >>>>>> https://github.com/apache/incubator-flink/pull/4 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek < >>>>>> [email protected] >>>>>> wrote: >>>>>> >>>>>> It would work on binary data, for example for tuples it would become: >>>>>>> >>>>>>> public Comparable extractKeys(DataInputView in) { >>>>>>> Object[] fields = ... >>>>>>> // extract only relevant fields from binary input and save in >>>>>>> fields >>>>>>> return Tuple(fields) // something like that >>>>>>> } >>>>>>> >>>>>>> And for normalized keys something similar can be done. >>>>>>> >>>>>>> Aljoscha >>>>>>> >>>>>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[email protected]> >>>>>>> >>>>>> wrote: >>> >>>> The design of the comparators so far was to make them work on the >>>>>>>> binary >>>>>>>> data. That we need to retain, in my opinion, otherwise there is no >>>>>>>> way to get good performance out of working on serialized data. >>>>>>>> >>>>>>>> I personally think that creating a tuple2 (key/value pair) when >>>>>>>> using >>>>>>>> selector functions is actually good: >>>>>>>> The key type (being treated by its dedicated comparator) benefits >>>>>>>> >>>>>>> from >>> >>>> all >>>>>>> >>>>>>>> the optimizations implemented for that type (bin copying, normalized >>>>>>>> >>>>>>> keys, >>>>>>> >>>>>>>> ...) >>>>>>>> That would be very hard to incorporate into any comparator that just >>>>>>>> deserializes some comparable. >>>>>>>> >>>>>>>> Also, the key extractor can contain sort of heavy magic (such as to >>>>>>>> block >>>>>>>> keys), whatever a user put in there. If we put that into the >>>>>>>> comparator, >>>>>>>> >>>>>>> it >>>>>>> >>>>>>>> gets called for >>>>>>>> every comparison! >>>>>>>> >>>>>>>> I do agree, though, that we need to come up with a better interface >>>>>>>> that >>>>>>>> seamlessly allows working on binary versions and on objects, without >>>>>>>> duplicating too much code. >>>>>>>> >>>>>>>> From your suggestion, I am not sure I got everything. Could you >>>>>>>> >>>>>>> post a >>> >>>> concrete example or code? >>>>>>>> >>>>>>>> Stephan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek < >>>>>>>> >>>>>>> [email protected]> >>> >>>> wrote: >>>>>>>> >>>>>>>> Hi Guys, >>>>>>>>> while porting the Java API to Scala I'm noticing how complicated >>>>>>>>> things are because of how our TypeComparators work: 1) There is >>>>>>>>> only >>>>>>>>> one type of comparator per TypeInformation which is created by the >>>>>>>>> TypeInformation. Therefore, our KeySelectors are not actually >>>>>>>>> implemented as comparators but as generated mappers that emit a >>>>>>>>> Tuple2, because you wouldn't for example be able to generate a >>>>>>>>> SelectorFunctionComparator for a TupleTypeInfo. (There's also a >>>>>>>>> lot >>>>>>>>> of magic going on with wrapping and unwrapping those tuples in >>>>>>>>> >>>>>>>> Reduce, >>> >>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there >>>>>>>>> is special case code for the combinations that work. This will only >>>>>>>>> get worse when we properly introduce POJO types, which should work >>>>>>>>> well with tuple comparators and the other comparators. >>>>>>>>> >>>>>>>>> My proposal is this: No more TypeComparator on a per type basis. >>>>>>>>> >>>>>>>> Just >>> >>>> a generic comparator and PairComparator that work on Comparable. >>>>>>>>> >>>>>>>> What >>> >>>> used to be TypeComparators become SelectionExtractors that return a >>>>>>>>> Comparable. Make Tuple comparable or add new ComparableTuple. The >>>>>>>>> TupleSelectionExtractor would return a comparable tuple of the >>>>>>>>> appropriate length (same for POJOs). For Tuple extractors that >>>>>>>>> >>>>>>>> operate >>> >>>> on only one field they would immediately return that field, without >>>>>>>>> wrapping it in a tuple. This would directly support our existing >>>>>>>>> KeySelector functions since the already return Comparable, when >>>>>>>>> returning a tuple in a key selector function this would be >>>>>>>>> >>>>>>>> compatible >>> >>>> with a TupleSelectionExtractor (on the other join side, for >>>>>>>>> >>>>>>>> example). >>> >>>> That's my idea. What do you think? I think the current state is not >>>>>>>>> maintainable, so we should do something quickly. :D >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Aljoscha >>>>>>>>> >>>>>>>>> >
