Excellent thoughts! We need to find someone in our community with a
sorting wish.....
On 10/28/17 2:26 PM, Chen Luo wrote:
Another sort performance issue to notice is the "normalized key". When we
sort frames of tuples, we build a directory of pointers to avoid memory
movements. Each record pointer has four integers, i.e., *frame Id, tuple
start, tuple end, normalized key*. Normalized key is critical to speed up
comparisons by avoiding random memory accesses. On 1GB random records,
using normalized key can improve sort performance by *3-4x*. However, this
normalized key idea is under utilized in AsterixDB layer:
- The current length of normalized key is only 4 bytes, which limits its
applicability of longer (e.g., time/long/double) but fixed length types.
- Normalized key only skips memory accesses when two keys are different.
However, when records have a lot of duplicate keys, e.g., order on some
stateID or countyID, current mechanism won't help too much.
- Only a limited number of types have implemented this normalized key
computer (see [1]). For example, *UUID does not support it*, which makes
it much slower to sort UUID keys. Ideally, every sortable type should have
a corresponding mechanism to compute normalized key.
These issues need to be resolved (maybe later) to improve sort performance.
Otherwise, even the best cache-friendly sorting algorithm won't help too
much, since each comparison would incur multiple cache misses.
[1]
https://github.com/apache/asterixdb/blob/b0595873432ccfb298d780da92151bc9dd2f6840/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/NormalizedKeyComputerFactoryProvider.java
On Sat, Oct 28, 2017 at 1:50 PM, Chen Luo <[email protected]> wrote:
That can be done in Hyracks layer, but enabling Quicksort in AsterixDB has
other issues. Right now AsterixDB only uses merge sort because of its
stableness. Once switching to Quicksort, we have to make sure all the other
operators are happy with this change (some operators are OK with it, but
some are not).
On Sat, Oct 28, 2017 at 12:05 PM, Mike Carey <[email protected]> wrote:
So the word on the web seems maybe to be that Quicksort is generally
superior and cache-friendly (cache oblivious). Wondering if we should just
get our Quicksort code under control?
On 10/28/17 11:36 AM, Chen Luo wrote:
Not exactly sure about this right now... But since TimSort is
essentially a
combination of insertion sort and merge sort, it's cache-friendliness
won't
be worse than our merge sort.
This TimSort could be served as a short-term plugin-and-play improvement
of
our sorting algorithm. It is still stable, the same as our current merge
sort, but faster, especially on partially ordered dataset.
Best regards,
Chen Luo
On Sat, Oct 28, 2017 at 10:58 AM, Mike Carey <[email protected]> wrote:
How is it on cache-friendliness?
On 10/27/17 11:38 PM, abdullah alamoudi wrote:
While I have no answer to the question of legality, this sounds great.
~Abdullah.
On Oct 27, 2017, at 9:20 PM, Chen Luo <[email protected]> wrote:
Hi devs,
I have adapted the TimSort algorithm used in JDK (java.util.TimSort)
into
Hyracks, which gives 10-20% performance improvements on random data.
It
will be more useful if the input data is partially sorted, e.g.,
primary
keys fetched from secondary index scan, which I haven't got time to
experiment with.
*Before going any further, is it legal to adapt some algorithm
implementation from JDK into our codebase? *I saw the JDK
implementation
itself is adopted from
http://svn.python.org/projects/python/trunk/Objects/listsort.txt as
well.
Best regards,
Chen Luo