[
https://issues.apache.org/jira/browse/FLINK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Gevay resolved FLINK-4867.
--------------------------------
Resolution: Done
I'm closing this, as [~heytitle] did the performance investigation, and
concluded that code generation is worthwhile to implement.
The Jira for actually implementing this is here:
https://issues.apache.org/jira/browse/FLINK-5734
> Investigate code generation for improving sort performance
> ----------------------------------------------------------
>
> Key: FLINK-4867
> URL: https://issues.apache.org/jira/browse/FLINK-4867
> Project: Flink
> Issue Type: Sub-task
> Components: Local Runtime
> Reporter: Gabor Gevay
> Assignee: Gabor Gevay
> Priority: Minor
> Labels: performance
> Attachments: Evaluation of Code Generation Approach for improving
> Flinkās NormalizedKeySorter.pdf
>
>
> This issue is for investigating whether code generation could speed up
> sorting. We should make some performance measurements on hand-written code
> that is similar to what we could generate, to see whether investing more time
> into this is worth it. If we find that it is worth it, we can open a second
> Jira for the actual implementation of the code generation.
> I think we could generate one class at places where we currently instantiate
> {{QuickSort}}. This generated class would include the functionality of
> {{QuickSort}}, {{NormalizedKeySorter}} or {{FixedLengthRecordSorter}},
> {{MemorySegment.compare}}, and {{MemorySegment.swap}}.
> Btw. I'm planning to give this as a student project at a TU Berlin course in
> the next few months.
> Some concrete ideas about how could a generated sorter be faster than the
> current sorting code:
> * {{MemorySegment.compare}} could be specialized for
> ** Length: for small records, the loop could be unrolled
> ** Endiannes (currently it is optimized for big endian; and in the little
> endian case (e.g. x86) it does a Long.reverseBytes for each long read)
> * {{MemorySegment.swapBytes}}
> ** In case of small records, using three {{UNSAFE.copyMemory}} is probably
> not as efficient as a specialized swap, because
> *** We could use total loop unrolling in generated code (because we know the
> exact record size)
> *** {{UNSAFE.copyMemory}} checks for alignment first \[6,9\]
> *** We will only need 2/3 the memory bandwidth, because the temporary storage
> could be a register if we swap one byte (or one {{long}}) at a time
> ** several checks might be eliminated
> * Better inlining behaviour could be achieved
> ** Virtual function calls to the methods of {{InMemorySorter}} could be
> eliminated. (Note, that these are problematic to devirtualize by the JVM if
> there are different derived classes used in a single Flink job (see \[8,7\]).)
> ** {{MemorySegment.swapBytes}} is probably not inlined currently, because the
> excessive checks make it too large
> ** {{MemorySegment.compare}} is probably also not inlined currently, because
> those two while loops are too large
> \[6\] http://www.docjar.com/docs/api/sun/misc/Unsafe.html#copyMemory(Object,
> long, Object, long, long)
> \[7\] https://shipilev.net/blog/2015/black-magic-method-dispatch/
> \[8\]
> http://insightfullogic.com/2014/May/12/fast-and-megamorphic-what-influences-method-invoca/
> \[9\]
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/cpu/x86/vm/stubGenerator_x86_64.cpp#l2409
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)