On Tue, May 22, 2012 at 12:35 PM, Stefan Fuhrmann <eq...@web.de> wrote: >> Having a stable hash function sure doesn't seem like this would >> account for that reduction. Can you please elaborate? > > > Subversion up to and including 1.7 will serialize directories > as string->string hashes in FSFS. wordpress.org uses projects > as the top-level of its repository (just like Apache). So, every > commit writes a new version of that. At >26k projects, that's >>1.4MB per revision. > > In 1.8, one may activate directory deltification. After serialization, > the resulting text will be deltified just like any other node and > the result be zip-compressed. Many revisions are now about 2KB. > However, that hinges on successive versions of the directories > to produce serialized text. Even a random "shift" by a larger > number of entries will leave no 64 byte matches (our xdelta > granularity) within the 100k text windows used by xdelta.
Thanks for the explanation. However, this is a very specific use case - why can't we just use an explicitly sorted data structure here? I know Greg coded up a red-black tree for pocore...I'm sure there are plenty of guaranteed-sorted data structures under an ALv2 license. =) > Well, I am a developer and reproducibility between test runs > *does* matter to me. If ordering matters, then the presentation layer should enforce ordering. I would never expect a hash table to produce sorted ordering across runs. (In fact, if it does, it probably means it has a flaw in the implementation...which is exactly what happened with APR.) > On a more general note: We don't use hashes as a means to > randomize our data. For us, they are simply containers with > an average O(1) insertion and lookup behavior. The APR interface > also allows for iterating that container - so it *has* an ordering > and it has been quite stable under different operations and > over many versions of APR. Iterating isn't the same as ordering. > The change in 1.4.6 did *not* solve the fundamental performance > problem but it makes our life harder - at least for a while. > If we want a reproducible UI behavior, we must now eliminate > the use of hashes in all relevant output functions and replace > them with e.g. sorted arrays. That may take some time. Sure. >> And, the third point doesn't make any sense to me without a further >> explanation. -- justin >> > When we e.g. do an "svn ls -v" (TSVN repo browser), we will > create and fill the revprop hash for the respective revision > multiple times for each entry in the directory - just to produce > a few bytes of output. The hash function showed up in profiles > with 10% of the total runtime. Sorry - I'm not following how having a sorted hash table achieves this; it sounds like reuse of a data structure (not requiring memory allocs or something) - not that it is guaranteed to be stable on every call. > So, I tuned that. Because apr_hash_t is so popular in our code, > that very localized improvement will give (small) returns in > improved performance all over the place. Yes, but there may very well be penalties in other places that you haven't looked at yet. I know that I've seen shift+incr appear in traces before - switching to a much more expensive hash algorithm is going to make those cases worse. -- justin