Re: RFC: bitmap storage for precursors and phases

Stanislau Hlebik Sat, 18 Feb 2017 09:34:40 -0800

Excerpts from Jun Wu's message of 2017-02-17 19:14:12 -0800:
> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu <qu...@fb.com> wrote:
> > 
> > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +0000:
> > > > As I said before we will load all non-public revs in one set and all
> > >
> > > The problem is, loading a Python set from disk is O(size-of-the-set).
> > >
> > > Bitmap's loading cost should be basically 0 (with mmap). I think that's 
> > > why
> > > we want bitmap at the first place. There are other choices like packfile
> > > index, hash tables, but bitmap is the simplest and most efficient.
> > >
> > 
> > Hey folks,
> > 
> > I haven't yet seen mention of some considerations that seem very important
> > in driving the decision-making, so I'd appreciate it if someone could fill
> > me in.
> > 
> > Firstly, what's our current understanding of the sizes and compositions of
> > these sets of numbers? In theory, we have a lot of data from practical
> > application at Facebook, but nobody's brought it up.
> > 
> > To clarify why this matters, if an obsstore contains say 100 entries, then
> > a very naive implementation should be fine. If it's related to something
> > else, such as the size of, amount of activity in, or local age of a repo,
> > then maybe we have a different set of decisions to make.
> 
> First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter
> much in practice. I tend to always reason about future-proof solutions,
> which may or may not be a good habit - I'll adjust in the future.
> 
> I think there are multiple topics being discussed:
> 
>   1. How to solve the overhead loading hiddenrevs with minimal changes?
>   2. Why is the bitmap format interesting (future use-cases)?
> 
> For 1,
> 
>   I once tried to make one bit of revlog flag as the source of truth of
>   "hidden", which looks a nice solution. But it got rejected because revlog
>   has to be append-only. Moving the flag bits out formed the bitmap idea
>   naturally.
> 
>   Bitmap fits here but it does not mean we have to use it to solve this
>   particular problem. For example, stateful chg will get us some easy wins,
>   and see the problem analysis and proposed solution below.
> 
>   Note that we actually have a simple caching format for the "hiddenrevs".
>   See repoview.py, computehidden.
> 
>   But what's wrong with "computehidden" is the choice of the cache key - it
>   calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo,
>   'obsolete'), which reads the giant obsstore (in hg-committed) and do some
>   complex computation. The result is then hashed and used as a cache key of
>   "computehidden".
> 
>   So, if we do not want to know the "obsolete()" revisions, but just want to
>   know what's hidden. Changing the cache key would be a very straightforward
>   and effective solution which does not require a new file format.


Hm, changing the cache key to smth like phaseroots + obsstore +
changelog mtime + size  may be an easy and good win. That's a good idea.
Although it will still get invalidated after each write.

> 
>   That'll help commands like "hg st", "hg log -T foo", etc. But it won't
>   help complex "log"s which do require the "obsolete()" revisions. To speed
>   up the latter, we can probably just copy the caching infrastructure from
>   repoview.py.
> 
> For 2,
> 
>   len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think
>   about solutions that scale longer and are not too complex to build. That
>   may or may not be a good habit.
>   
>   That said, I agree if the set is small and perf, then whatever easier to
>   implement makes sense.
>   
>   I have some crazy ideas about the future where a giant bitmap is
>   practically useful. For example, source controlling "visible heads" and
>   the obsstore, then adding some C code to quickly fill a bitmap (requires
>   "set" and "setancestors/descendants"), then we can show the user whenever
>   their repo used to look like back in any time by using the bitmap filter.
>   That'll help investigate hard issues, and implement "hg undo".
> 
> > Secondly, what operations are being performed on these sets today, and do
> > we know if those operations are going to change?
> >
> > For instance, if a may be extended via new elements added to its end, then
> > an mmap-based implementation immediately becomes awkward. If a set is
> 
> Not that bad if we preallocate space.
> 
> > traversed or we perform operations like union or intersection, then a
> > bitmap might make sense if the set is very small and dense, but not if it's
> > big and sparse.
> 
> Not necessarily true if we use an advanced variant, like the "roaringbitmap"
> mentioned by you last year [1].
> 
> [1]: 
> https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087864.html
> 
> > (Jun's idea that mmap is effectively free is also not borne out in
> > practice, but that's a second-order issue here.)
> > 
> > So tell us what the needed API is, what the characteristics of the data
> > are, and that'll help understand how to make a good decision.
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: RFC: bitmap storage for precursors and phases

Reply via email to