i cant recall definitely what the numbers were so I'll just quote myself from the PR:

measurements were done using System.nanoTime()
time necessary for the comparison
Strings consisted of 90 characters

difference in the beginning of the string
New 4259
Old 23431

difference in the middle of the string
New 13560
Old 30757

difference at the end of the string
New 29385
Old 28293

for smaller strings (<10 i think) it was roughly 50% faster.


On 27.8.2014 22:10, Aljoscha Krettek wrote:
Ok, then we have to find some other solution for interop between
comparators. What's the performance increase from that pull request?

On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[email protected]> wrote:
A lot of comparisons (and that is what makes it fast) can happen on the
binary data, without turning them into a comparable.

The latest pull request in the String representation for example improved
String-keyed sorting quite a bit:

https://github.com/apache/incubator-flink/pull/4





On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[email protected]>
wrote:

It would work on binary data, for example for tuples it would become:

public Comparable extractKeys(DataInputView in) {
     Object[] fields = ...
     // extract only relevant fields from binary input and save in fields
     return Tuple(fields) // something like that
}

And for normalized keys something similar can be done.

Aljoscha

On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[email protected]> wrote:
The design of the comparators so far was to make them work on the binary
data. That we need to retain, in my opinion, otherwise there is no
way to get good performance out of working on serialized data.

I personally think that creating a tuple2 (key/value pair) when using
selector functions is actually good:
The key type (being treated by its dedicated comparator) benefits from
all
the optimizations implemented for that type (bin copying, normalized
keys,
...)
That would be very hard to incorporate into any comparator that just
deserializes some comparable.

Also, the key extractor can contain sort of heavy magic (such as to block
keys), whatever a user put in there. If we put that into the comparator,
it
gets called for
every comparison!

I do agree, though, that we need to come up with a better interface that
seamlessly allows working on binary versions and on objects, without
duplicating too much code.

 From your suggestion, I am not sure I got everything. Could you post a
concrete example or code?

Stephan




On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[email protected]>
wrote:

Hi Guys,
while porting the Java API to Scala I'm noticing how complicated
things are because of how our TypeComparators work: 1) There is only
one type of comparator per TypeInformation which is created by the
TypeInformation. Therefore, our KeySelectors are not actually
implemented as comparators but as generated mappers that emit a
Tuple2, because you wouldn't for example be able to generate a
SelectorFunctionComparator for a TupleTypeInfo.  (There's also a lot
of magic going on with wrapping and unwrapping those tuples in Reduce,
Join, and CoGroup.) 2) Comparators cannot really interoperate, there
is special case code for the combinations that work. This will only
get worse when we properly introduce POJO types, which should work
well with tuple comparators and the other comparators.

My proposal is this: No more TypeComparator on a per type basis. Just
a generic comparator and PairComparator that work on Comparable. What
used to be TypeComparators become SelectionExtractors that return a
Comparable. Make Tuple comparable or add new ComparableTuple.  The
TupleSelectionExtractor would return a comparable tuple of the
appropriate length (same for POJOs). For Tuple extractors that operate
on only one field they would immediately return that field, without
wrapping it in a tuple. This would directly support our existing
KeySelector functions since the already return Comparable, when
returning a tuple in a key selector function this would be compatible
with a TupleSelectionExtractor (on the other join side, for example).

That's my idea. What do you think? I think the current state is not
maintainable, so we should do something quickly. :D

Cheers,
Aljoscha


Reply via email to