On Fri, 2010-10-08 at 12:51 -0700, David Borowitz wrote: > Another place with a completely different order: Tree.add(mode, name, > sha). I don't suppose that's one we can safely change, is it? I guess we could check the types of mode and name and reverse them (and print a DeprecationWarning) if they're wrong.
It's not entirely nice but I think the cleanest way of fixing it. Cheers, Jelmer > On Thu, Oct 7, 2010 at 15:07, David Borowitz <[email protected]> > wrote: > Oh, also, another reason I'm leaning towards namedtuples at > this point is that they're fully compatible with the existing > API. With a custom class, assuming it's faster (which I > doubt), we'd either have to: > -Deprecate Tree.iteritems(), making there be *three* methods > that do the same thing. (Plus all the code changes to use the > new API.) > -Implement __iter__, basically turning the custom class into a > namedtuple anyway. > > > > On Thu, Oct 7, 2010 at 14:59, David Borowitz > <[email protected]> wrote: > I have some benchmark results for namedtuples vs. > tuples. First, the microbenchmark results: > $ python -m timeit -s 'from dulwich.objects import > TreeEntry; name = "foo/bar"; mode = 0100644; sha = "a" > * 20' 'x = TreeEntry(name, mode, sha)' > 1000000 loops, best of 3: 0.583 usec per loop > $ python -m timeit -s 'name = "foo/bar"; mode = > 0100644; sha = "a" * 20' 'x = (name, mode, sha)' > 10000000 loops, best of 3: 0.0753 usec per loop > > > Obviously the tuple constructor should win over > TreeEntry constructor, since the latter is a wrapper > around the former, and there's significant Python > function call overhead. But hey, 0.5us is still pretty > fast. > > > Then I ran a much bigger macrobenchmark (attached). > Basically, I cloned git.git, ran git unpack-object to > explode the repo into loose files, found all the tree > SHAs (without parsing the objects), then measured the > time to parse all those trees and iterate all their > entries. In the inner loop I also assigned all of the > tuple/namedtuple values to locals to check for > overhead there. > > > heapy was used to measure the heap. My workstation is > a Core 2 Quad with 8GB RAM running Ubuntu 10.04. (Note > that the whole heap fit easily in main memory.) > > > === namedtuple === > $ python tree_entry_macro.py > Read 128900 objects > Collected 43000 trees > Iterated 42900 trees > Collected 9237351 entries in 180.18s. Heap results > follow. > Partition of a set of 36927479 objects. Total size = > 2318516600 bytes. > Index Count % Size % Cumulative % Kind > (class / dict of class) > 0 18452772 50 1273945424 55 1273945424 55 str > 1 9237351 25 738988080 32 2012933504 87 tuple > 2 9237352 25 221696448 10 2234629952 96 int > 3 1 0 83886152 4 2318516104 100 list > 4 1 0 448 0 2318516552 100 > types.FrameType > 5 2 0 48 0 2318516600 100 float > > > === tuple === > $ python tree_entry_macro.py > Read 128900 objects > Collected 43000 trees > Iterated 42900 trees > Collected 9237351 entries in 179.49s. Heap results > follow. > Partition of a set of 36927479 objects. Total size = > 2318516600 bytes. > Index Count % Size % Cumulative % Kind > (class / dict of class) > 0 18452772 50 1273945424 55 1273945424 55 str > 1 9237351 25 738988080 32 2012933504 87 tuple > 2 9237352 25 221696448 10 2234629952 96 int > 3 1 0 83886152 4 2318516104 100 list > 4 1 0 448 0 2318516552 100 > types.FrameType > 5 2 0 48 0 2318516600 100 float > > > As expected, memory usage is identical, since > namedtuples are just tuples with no additional > per-instance memory overhead. Based on the > microbenchmark results alone, the expected overhead > for creating 9.2M TreeEntry objects alone is 9237351 * > (0.583 - 0.0753) = 4.7s. So I'm guessing the time > difference is not signifcant. > > > On Thu, Oct 7, 2010 at 08:46, David Borowitz > <[email protected]> wrote: > > > > On Thu, Oct 7, 2010 at 06:57, Jelmer Vernooij > <[email protected]> wrote: > On Wed, 2010-10-06 at 14:19 -0700, > David Borowitz wrote: > > In the stuff I'm doing on rename > detection, I've been bitten a few > > times by the weird ordering of > (name, mode, sha) tuples coming out of > > various tree iteration methods. > > > Currently, we have: > > Tree.entries() yields (mode, name, > sha) > > Tree.iteritems() and > BaseObjectStore.iter_tree_contents() > yield (name, > > mode, sha) > > index.commit_tree() takes (name, > sha, mode) > > This is all for hysterical raisins, > unfortunately. I'm all for fixing > it :-) > > > We should really standardize this, > preferably in one of three ways: > > 1. Use (name, mode, sha) everywhere. > > 2. Use a namedtuple of (name, mode, > sha). > > 2. Use a new data object. > > I would prefer using a namedtuple > would make some of the code I'm > > about to write cleaner. It would > also be backwards-compatible with the > > appropriate type. The downside is > that it's another feature that's new > > in 2.6. > > It doesn't really help that you have > two options 2 ! :-) > > This is a really performance-sensitive > area (at least for me), so if we > go for anything else than a tuple we > should make sure it doesn't have a > (significant) impact on performance. > > I don't think we should drop support > for pre-2.6 Python yet; we could of > course provide a custom implementation > for those who run 2.4 or 2.5, but > that seems like more trouble than (3). > > > Augie has told me from experience with hg that > data objects even with slots have worse > performance memory-wise than namedtuples. This > is presumably because there's a per-object > overhead for the slots, whereas namedtuples > have properties that live in the class dict. > I'm happy to run some benchmarks > > > Another thing about the namedtuple function is > that you can make it print out the class > definition it creates, which provides an easy > way to backport specific namedtuple types to > previous versions. > > 3 seems like the best solution if it > doesn't make things too much > slower. Of course, we could give it > semantics similar to what namedtuple > would give us. > > > I don't mind much either way. It sounds like > benchmark results will be the deciding factor. > > > In my ideal world we would get rid > of Tree.entries() and change > > index.commit_tree() to use the > standard format, but I don't have a > > sense of how widely either of those > are used. Thoughts on how to > > proceed? > > Changing commit_tree() is probably > possible as it's not very heavily > used. Removing Tree.entries() is not > possible I think, at least not > without deprecating it first. > > > +1 to officially deprecating it. I thought it > was kind of unofficially deprecated already. > (There's two functions that are almost > identical, one is implemented in terms of the > other, and it's only for "historical > reasons"--that's pretty much what a deprecated > function looks like :) > > Cheers, > > Jelmer > > > > > >
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Mailing list: https://launchpad.net/~dulwich-users Post to : [email protected] Unsubscribe : https://launchpad.net/~dulwich-users More help : https://help.launchpad.net/ListHelp

