nfsantos commented on PR #1616: URL: https://github.com/apache/jackrabbit-oak/pull/1616#issuecomment-2264692270
> (Same as before), this could cause out-of-memory. > > What about, to avoid possible OOME, and I think it would be easier to understand: > > * Instead of a hash map, use an array (fixed array of eg. 16 K entries) > * Calculate the index from the hash (using bitwise and) > * Store the entry in the array if it doesn't match > * Use the entry in the array if it does > > That would be even faster I'm pretty sure. > > The memory usage would be at most ~ 16 * 1024 * 4 bytes or so (less than 1 MB). > > And we could store all the (small) path elements, and no longer have to worry about "startsWith" etc. I believe this strategy is already implemented in the `oak-store-document` module, here: https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/StringCache.java I think in this particular case that strategy would not work as well. The frequency of Strings of the path segments is extremely skewed, with a few words appearing in the majority of paths. These are the words of the top level paths (for instance, `/content/dam/<companyname>`) or the paths created by JCR which contain segments like `jcr:content`. To get the most of interning, these Strings must all be cached. The algorithm you suggest does not guarantee that these very common Strings will all be cached. For instance, in an extreme case where all common Strings hash into the same slot in the array, they would all be competing with each other. And another issue is that if we remove the guard that limits the Strings that can be cached, then the large number of infrequent Strings that appear on paths would displace from the table the very common Strings. I have bounded the HashMap to a max of 1024 entries, to avoid OOME. It could still happen if those 1024 Strings are huge, but if that was the case, Oak would run out of memory much before, because there are many other parts of the code that assume it's possible to store in memory more than 1024 full paths. In the tests that I ran against two medium-sized Flat File Stores taken from real OAK repositories, when scanning 10 million paths and storing in memory an array with all SortKeys, at the end there were around 50 Strings in the HashMap. And interning these Strings reduced the size of the heap at the end of the scanning by around 800MB. Interestingly, sorting the array of 10 million SortKeys was faster with the interned Strings, it took around 3s instead of 5s. I'm assuming that it is because `String.equals()` first checks if the objects compared are the same (`==`) before comparing the value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
