A lot of comparisons (and that is what makes it fast) can happen on the binary data, without turning them into a comparable.
The latest pull request in the String representation for example improved String-keyed sorting quite a bit: https://github.com/apache/incubator-flink/pull/4 On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[email protected]> wrote: > It would work on binary data, for example for tuples it would become: > > public Comparable extractKeys(DataInputView in) { > Object[] fields = ... > // extract only relevant fields from binary input and save in fields > return Tuple(fields) // something like that > } > > And for normalized keys something similar can be done. > > Aljoscha > > On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[email protected]> wrote: > > The design of the comparators so far was to make them work on the binary > > data. That we need to retain, in my opinion, otherwise there is no > > way to get good performance out of working on serialized data. > > > > I personally think that creating a tuple2 (key/value pair) when using > > selector functions is actually good: > > The key type (being treated by its dedicated comparator) benefits from > all > > the optimizations implemented for that type (bin copying, normalized > keys, > > ...) > > That would be very hard to incorporate into any comparator that just > > deserializes some comparable. > > > > Also, the key extractor can contain sort of heavy magic (such as to block > > keys), whatever a user put in there. If we put that into the comparator, > it > > gets called for > > every comparison! > > > > I do agree, though, that we need to come up with a better interface that > > seamlessly allows working on binary versions and on objects, without > > duplicating too much code. > > > > From your suggestion, I am not sure I got everything. Could you post a > > concrete example or code? > > > > Stephan > > > > > > > > > > On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[email protected]> > > wrote: > > > >> Hi Guys, > >> while porting the Java API to Scala I'm noticing how complicated > >> things are because of how our TypeComparators work: 1) There is only > >> one type of comparator per TypeInformation which is created by the > >> TypeInformation. Therefore, our KeySelectors are not actually > >> implemented as comparators but as generated mappers that emit a > >> Tuple2, because you wouldn't for example be able to generate a > >> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot > >> of magic going on with wrapping and unwrapping those tuples in Reduce, > >> Join, and CoGroup.) 2) Comparators cannot really interoperate, there > >> is special case code for the combinations that work. This will only > >> get worse when we properly introduce POJO types, which should work > >> well with tuple comparators and the other comparators. > >> > >> My proposal is this: No more TypeComparator on a per type basis. Just > >> a generic comparator and PairComparator that work on Comparable. What > >> used to be TypeComparators become SelectionExtractors that return a > >> Comparable. Make Tuple comparable or add new ComparableTuple. The > >> TupleSelectionExtractor would return a comparable tuple of the > >> appropriate length (same for POJOs). For Tuple extractors that operate > >> on only one field they would immediately return that field, without > >> wrapping it in a tuple. This would directly support our existing > >> KeySelector functions since the already return Comparable, when > >> returning a tuple in a key selector function this would be compatible > >> with a TupleSelectionExtractor (on the other join side, for example). > >> > >> That's my idea. What do you think? I think the current state is not > >> maintainable, so we should do something quickly. :D > >> > >> Cheers, > >> Aljoscha > >> >
