I have been digging through the code looking so I can understand the v2
format and I have some ideas on how things might be improved, and some
questions so that I understand.

V1 supported the concept of messages being added and deleted from
the git repository all while keeping a full history of everything that
went on.  The V2 code appears to have the name 'm' for added and 'd' for
deleted, but the public-inbox-index code appears to expect deletes to
happen by way of an altered history that totally purge the commits,
and does not process the 'd' entries.

What is the thinking about deleted entries, and for v2 what is the
preferred way to delete mail from a public inbox git repository and why?

Size.  Reading the history of the public inbox meta mailling list and
playing around I discovered that I can shave off about 100M of the V2
size of the git public inbox git repository but pushing all of the
messages into a single commit.  Not great for day to day operation,
but if rebasses are part of the plan, and old archives part of the
challenge I see quite a lot of potential for old archives to be reduced
to a git repository with a single commit.

Names.  Is there a good reason not to use message numbers as the names
in the git repositories?  (Other than the cost to change the code?) That
would remove the need for treat the sqlite msgmap database as precious,
and it would make it easier to recover if an nntp server goes away.  In
V2 format the git mailing list git repository is only about 2M larger if
each message has it's msg number as it's name.  Plus the git log
is easier to read as messages are all + or -.

xapian.  Can the Xapian database be made optional in V2?  I absolutely
think a quick search for terms and other things very valuable, so I
would never suggest giving up Xapian.  On the other hand on my personal
laptop the xapian database for lkml takes ages and ages to build, and it
pushes the system into swap.  Which is all around unpleasant.  That
seems to eat into the distributed nature of the goal of public inbox.

I have tried to see what could be done that might shrink the size of
the xapian database.  The only think I could think of is perhaps
sharding the xapian database by time/msgnum ranges.   That would allow
the old xapians databases to be compacted and forgotten about, and I
think it would allow less wastage in the current xapian database as it
would be smaller, so wasting 50% space (or whatever the btrees waste)
would be less of an issue.  And as smaller databases are faster I think
that would in general be a help.

Time permitting I am willing to do some of this work so that
public-inbox works well for me.  I want to see what your vision is for
the code before I start anything.


unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/

Reply via email to