On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas < heikki.linnakan...@enterprisedb.com> wrote:
> I think we should rewrite the way we track the parents completely. Rather > than keep a path stack attached to every node buffer, let's just maintain a > second hash table that contains the parent of every internal node. Whenever > a downlink is moved to another page, update the hash table with the new > information. That way we always have up-to-date information about the > parent of every internal node. > > That's much easier to understand than the path stack structures we have > now. I think the overall memory consumption will be about the same too. > Although we need the extra hash table with one entry for every internal > node, we get rid of the path stack structs, which are also one per every > internal node at the moment. > > I believe it is faster too. I added some instrumentation to the existing > gist code (with the additional getNodeBuffer() call added to fix this bug), > to measure the time spent moving right, when refinding the parent of a > page. I added gettimeofday() calls before and after moving right, and > summed the total. In my test case, the final index size was about 19GB, and > the index build took 3545 seconds (59 minutes). Of that time, 580 seconds > (~ 10 minutes) was spent moving right to refind parents. That's a lot. I > also printed a line whenever a refind operation had to move right 20 pages > or more. That happened 2482 times during the build, in the worst case we > moved right over 40000 pages. > > Attached is a patch to replace the path stacks with a hash table. With > this patch, the index build time in my test case dropped from 59 minutes to > about 55 minutes. I don'αΊ— know how representative or repeatable this test > case is, but this definitely seems very worthwhile, not only because it > fixes the bug and makes the code simpler, but also on performance grounds. > Cool, seems that we've both simplier and faster implementation of finding parent. Thanks! > Alexander, do you still have the test environments and data lying around > that you used for GiST buffering testing last summer? Could you rerun some > of those tests with this patch? I think I can restore test environment and data. Will rerun tests soon. ------ With best regards, Alexander Korotkov.