Eric Wong <e...@80x24.org> wrote:
> Timing "git rev-list --objects --all |wc -l" reveals much bigger
> differences.  Timings are a bit fuzzy since (AFAIK) this is a
> shared system, but it's not really a contest:
> 
> 2-38         ~5 minutes
> 2-2-2-34     ~30 seconds
> 2-2-36       ~30 seconds
> 1-file       ~5 seconds
> 
> Smaller trees are way faster :)

The LKML 2000-2017 archives (16GB uncompressed mboxes) I have
are 6.3G of objects with 1-file storage in git and took around
33 minutes to do a full import utilizing a single core and
single git repo (no deduplication checks).

"git repack -adb" takes about 2 minutes

"git rev-list --objects --all |wc -l"
takes around 1 minute with over 8 million objects

As a baseline, pure Perl parsing of the mboxes (no writing to
git) was around 23 minutes on a single core; so git-fast-import
does add some overhead but probably not as much as Xapian will
add for the initial import.

The v1 2-38 code slowed to a crawl as more data got into the
repo and I gave up after it hit 18G and hit snags with
badly-formatted dates (worked around by relying on Date::Parse
instead of git's RFC2822 parser).

Side note:  Using 4 parallel processes for the parse-only tests
took around 10.5 minutes; while 2 processes took around 11-12
minutes.  Then I realized 2 of the 4 processors were HT,
so it appears HT doesn't help much with Perl parsing...

> In other words, git scales infinitely well to deep history
> depths, but not to breadth of large trees[1].

Serving a ~6G for clones is still a lot of bandwidth; so
partitioning the git repos to limit the size of each clone
seems worth it.

Yearly partitions is probably too frequent and we'd end up with
too many packs (and resulting more open-FDs, cache-misses,
metadata stored in Xapian).  I think partitioning based on
message-count/sizes might be a better metric for splitting as
LKML seems to get more traffic year-after-year.

> Marking spam and handling message removals might be a little
> trickier as chronology will have to be taken into account...
> (will post more on this, later)

Keeping track of everything in Xapian while walking backwards
through git history shouldn't be a big problem, actually.
(Xapian has read-after-write consistency)

However, trying to reason about partitioning of Xapian DBs
across time/message-count boundaries was making my head hurt and
now I'm not sure if it's necessary to partition Xapian DBs.

While normal text indexing is trivial to partition and
parallelize, associating messages with thread IDs requires
"global" knowledge spanning all partitions (since mail threads
span multiple periods).  Unfortunately, this requires much
synchronization and synchronization hurts parallelism.

Partitioning Xapian DBs is useful to speed up full-indexing and
not much else.  Full-(re)indexing is a rare event, and can be
done on a cold DB while the hot one is taking traffic.  In fact,
I would expect lookups on partitioned DBs to be slower since it
has more files to go through and has to map things like
internal document_ids to non-conflicting ones.

Also, we don't serve Xapian data to be cloned; which is the main
reason to do partitioning of git storage...
--
unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/

Reply via email to