[
https://issues.apache.org/jira/browse/CASSANDRA-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161161#comment-13161161
]
Evgeny Ryabitskiy commented on CASSANDRA-3545:
----------------------------------------------
Yep, a little bit complicated. That is why it is fully covered by JUnits and
comments.
Cobertura reports:
Classes in this File Line Coverage Branch Coverage Complexity
CollectionSearchUtil 91% 51/56 81% 36/44 6,5
There is NO bunch of copying to a temporary collection. There only one array of
size sqrt(2*N), allocated once and coping is linear (same as iterating).
Let's analyze resources required for this search on N size collection:
Memory usage: == sqrt(2*N)
Array copping: <= N
Iteration ( next() execution): <= N
Compare (with MD5/Column comparator): <= sqrt(2*N)
I have solution (see patch 1) that perform iterating instead of allocation
array, but it will require O(N^2) iterating in worst case.
In second patch memory usage is trade of for only one passage with iterator.
Iteration can be slow, so array is much better.
You can check that for million columns search (pretty big row) it will be array
with length: ~1440. Not too much for such huge search I think .
In case of 10k columns row it will be only 144 length array, with is pretty few.
About getSortedColumns(byte[] startWith):
Yes, it is another good solution. binarySearch is little bit faster in case you
have indexed access to underling Columns (like List or Array).
But there is still one disadvantage: My patch is solving this problem and
changes only few code lines
and this solution requires much more code changes. Lot's of code changes - low
release stability. Sorry for sharing pain in JIRA tickets but 1.0.3 seems to be
last stable release :(
As a compromise I can suggest to apply this patch and add ticket for feature to
cleanup code, move to binary search and new API in ISortedColumns.
> Fix very low Secondary Index performance
> ----------------------------------------
>
> Key: CASSANDRA-3545
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3545
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.7.0
> Reporter: Evgeny Ryabitskiy
> Fix For: 1.0.6
>
> Attachments: CASSANDRA-3545.patch, CASSANDRA-3545_v2.patch,
> IndexSearchPerformance.png
>
>
> While performing index search + value filtering over large Index Row ( ~100k
> keys per index value) with chunks (size of 512-1024 keys) search time is
> about 8-12 seconds, which is very very low.
> After profiling I got this picture:
> 60% of search time is calculating MD5 hash with MessageDigester (Of cause it
> is because of RundomPartitioner).
> 33% of search time (half of all MD5 hash calculating time) is double
> calculating of MD5 for comparing two row keys while rotating Index row to
> startKey (when performing search query for next chunk).
> I see several performance improvements:
> 1) Use good algorithm to search startKey in sorted collection, that is faster
> then iteration over all keys. This solution is on first place because it
> simple, need only local code changes and should solve problem (increase
> search in multiple times).
> 2) Don't calculate MD5 hash for startKey every time. It's optimal to compute
> it once (so search will be twice faster).
> Also need local code changes.
> 3) Think about something faster that MD5 for hashing (like
> TigerRandomPartitioner with Tiger/128 hash).
> Need research and maybe this research was done.
> 4) Don't use Tokens (with MD5 hash for RandomPartitioner) for comparing and
> sorting keys in index rows. In index rows, keys can be stored and compared
> with simple Byte Comparator.
> This solution requires huge code changes.
> I'm going to start from first solution. Next improvements can be done with
> next tickets.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira