The basic idea is to outsource deduplication to Xapian and use git as dumb storage. This yields huge dividends in object traversal based on preliminary tests: https://public-inbox.org/meta/20180209205140.GA11047@dcvr/
Additionally, insertion time does not degrade due to giant tree objects which plagued the initial v1 design. There's also a couple of small fixes along the way to make it tolerate some crap in older archives. The search indexer and content-based deduplication will still need to be worked on. Eric Wong (Contractor, The Linux Foundation) (17): AUTHORS: add The Linux Foundation watch_maildir: allow '-' in mail filename scripts/import_vger_from_mbox: relax From_ line match slightly import: stop writing legacy ssoma.index by default import: begin supporting this without ssoma.lock import: initial handling for v2 t/import: test for last_object_id insertion content_id: add test case searchmsg: add mid_mime import for _extract_mid scripts/import_vger_from_mbox: support --dry-run option import: APIs to support v2 use search: free up 'Q' prefix for a real unique identifier searchidx: fix comment around next_thread_id address: extract more characters from email addresses import: pass "raw" dates to git-fast-import(1) scripts/import_vger_from_mbox: use v2 layout for import import: quiet down warnings from bogus From: lines -- unsubscribe: meta+unsubscr...@public-inbox.org archive: https://public-inbox.org/meta/