On Wed, 02 Dec 2015 05:58:24 +0000, Navin wrote: > Nice to see this interesting post and learn. > > I have a few questions. > > 1) This is offline datastructure since you don't know how the elements > of the future are going to be ie dynamic. ie later elements from n to 2n > can break or change your heaps as such in worst case or is it a dynamic > data structure ?
You can usually change an offline datastructure into an amortized online one. (I think.) You can usually change an amortized online algorithm into a deamortized one with a series of ugly hacks and large increases in constants. I think you can make insertions work and not be too terrible with an amortized algorithm. Something like: To insert a value X into the square heap, find the largest heap with a head equal to X. If there is none, find the smallest heap with a head greater than X. If there is none, choose the final heap. If that heap is not full, insert X into it. If it is full, trigger a rebalance. 'Rebalance' is an operation we use to ensure that each intermediate heap is exactly half full (rounding down) and the final heap is no more than half full. We start at a given heap. We calculate its surplus -- the number of elements it has in excess of half its capacity. We scan to the right (increasing heap sizes) until we find a heap with at least enough spare capacity to hold the surpluses of all the prior heaps we've scanned. If we reach the end of the square, we create a new heap. This is the target heap. Now we repeatedly bubble up values from lower heaps to higher ones until no heap between the one we tried to insert into and the target heap has any surplus. This costs us in the worst case something like O(n * log(n) * sqrt(n)). I haven't done the full analysis. But after doing a large rebalance, you shouldn't need to do a large rebalance for quite a while, so the average case, even on adversarial data, should be not so horrible. Deletions are simple: find the element and remove it from its heap. This can, however, reduce a heap to less than half filled. (This means I can insert and delete elements in an adversarial pattern to force all elements into one heap.) A similar rebalance algorithm is needed, but it's harder on this side because we're using max heaps everywhere and need to fill lower heaps. > 2) Searching in min or max heaps is bad isn't it ? Linear time in the worst case. If you're looking for something in a max heap that happens to be the smallest element, it can be in any leaf node, which gives you half the tree to look through. And the average isn't much better. However, if we can guarantee that an element is in one heap inside this square heap, it costs us O(sqrt n) to search that heap since that heap has no more than sqrt(n) elements. > Lets say we choose > max heaps. Now we have the root as 10^9 in the second last heap ie > around n^2 elements. The children of it are 4*10^8 and 5*10^8 . If i'm > searching for say 4.5 *10^8 my job is easy but if i'm searching for > 1000, i have to search in both the subtrees and it goes linear and > becomes around n^2 in the worst case. Did i overlook anything ? > > Instead of heaps, a single sorted array or breaking them into a series > of sorted arrays ie skip lists kind of stuff would be fine if it just > want a offline Data structure ? A skiplist is great. Problem is, it's random access to memory. That's bad enough at the best of times, and it's utter garbage if you're storing data on spinning disks. Even if you store it as a sorted list rather than a linked list, which means you never insert anything ever, it's O(log n/ B) seeks and reads to find an element. If you store it as a linked list, it's O(log n) reads. A square heap with an insertion algorithm as I describe gives you O(1) seeks and O(sqrt n/B) reads. (Here, B stands for block size, which in this case is the number of data elements you will naturally get for free or almost for free after reading one byte. When I was doing data structure research, the rule of thumb was that a block is roughly 1MB for spinning disks. Divide that by element size to get B. We track seeks separately because they're expensive. 9ms is a reasonable seek time. Comparatively, reading data sequentially is extremely cheap, especially if you can inform the disk scheduler that you are going to perform sequential reads. Seeks are expensive no matter what, even if you can inform the scheduler in advance -- and the skiplist doesn't let you do that. Even if you switch to SSDs, SSDs tend to be ~4x faster on sequential reads than random reads.) So if you have 64-byte data structures, for instance, and you've got a billion on disk, you have sixteen seeks for a skiplist on average, which costs you 144ms. (You can drop that a bit by keeping the top of the skiplist in memory.) The square heap, on the other hand? You've got a total of ~1500 heaps, so you can store their offsets and heads in memory. That means you identify the heap containing the element you're searching for without touching disk, and scanning the heap costs one seek and a scan. You're scanning 700K elements on average, which means you need to look through forty or so blocks. So the square heap could well be more efficient here. It costs ~3x the number of reads, but they're all sequential, so while it's close, the square heap probably wins out on average. But this isn't yet a viable real-world algorithm for most things because of amortized inserts that promise to be expensive even after they're deamortized. And I haven't even sketched out an algorithm for deletes. At least range queries will typically be reasonably fast. > or is this some domain specific data Structure where you only/cre want > max/min in some sequence ?