Duplicate detection based on `content_id' now works and rejects
obviously re-sent messages with the same Message-Id.

Since many historical messages already have multiple Message-Ids (some
from buggy versions of git-send-email), we will inject Message-Ids as
needed to differentiate messages with the SAME Message-Id.  This
prevents NNTP readers from missing out on messages.

Internally, the Message-Id we _favor_ for NNTP is also the one which
gets used for rendering threads.

Excessively long Message-Ids are just truncated to 244 for now (Xapian
limit for terms).  I hope it's not an abuse vector going forward (only
one spam message used it), but this is another problem our
inject-new-Message-Id-on-duplicate scheme "solves".

Internal timestamps used for sorting now favor the first (last-added)
Received: header since is more likely to be correct than the Date:

A wrong Date: header will still show up in the per-message ("permalink")
view, so it can still be used to embarass people with bad clocks :P
(Of course, downloadable mboxes will continue to show them).

For thread skeleton (index) views in HTML, we use the internal
timestamp for now; but maybe we'll use the Date: like the permalink
view.  Maybe internally there can be two timestamps like git's
author-vs-committer dates.

Xapian index size is reduced, as the "nq:" search field is no longer
redundantly storing information that would be in searchable diff
fields (df* in https://public-inbox.org/git/_/text/help/).
This (along with remembering to run fstrim(8)) seems to have
reduced best-case indexing speed to around 3.5 hours for the
2000-2017 dataset I'm using \o/

Eric Wong (Contractor, The Linux Foundation) (34):
  v2writable: delete ::Import obj when ->done
  search: remove informational "warning" message
  searchidx: add PID to error message when die-ing
  content_id: special treatment for Message-Id headers
  evcleanup: disable outside of daemon
  v2writable: deduplicate detection on add
  evcleanup: do not create event loop if nothing was registered
  mid: add `mids' and `references' methods for extraction
  content_id: use `mids' and `references' for MID extraction
  searchidx: use new `references' method for parsing References
  content_id: no need to be human-friendly
  v2writable: inject new Message-IDs on true duplicates
  search: revert to using 'Q' as a uniQue id per-Xapian conventions
  searchidx: support indexing multiple MIDs
  mid: be strict with References, but loose on Message-Id
  searchidx: avoid excessive XNQ indexing with diffs
  searchidxskeleton: add a note about locking
  v2writable: generated Message-ID goes first
  searchidx: use add_boolean_term for internal terms
  searchidx: add NNTP article number as a searchable term
  mid: truncate excessively long MIDs early
  nntp: use NNTP article numbers for lookups
  nntp: fix NEWNEWS command
  searchidx: store the primary MID in doc data for NNTP
  import: consolidate object info for v2 imports
  v2: avoid redundant/repeated configs for git partition repos
  INSTALL: document more optional dependencies
  search: favor skeleton DB for lookup_mail
  search: each_smsg_by_mid uses skeleton if available
  v2writable: remove unnecessary skeleton commit
  favor Received: date over Date: header globally
  import: fall back to Sender for extracting name and email
  scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
  v2writable: detect and use previous partition count

 INSTALL                              |  13 ++
 MANIFEST                             |   2 +
 lib/PublicInbox/ContentId.pm         |  32 +++--
 lib/PublicInbox/Daemon.pm            |   1 +
 lib/PublicInbox/EvCleanup.pm         |   6 +-
 lib/PublicInbox/ExtMsg.pm            |   2 +-
 lib/PublicInbox/Import.pm            |  99 ++++++-------
 lib/PublicInbox/Inbox.pm             |   1 +
 lib/PublicInbox/MID.pm               |  55 +++++++-
 lib/PublicInbox/MsgTime.pm           |  51 +++++++
 lib/PublicInbox/NNTP.pm              |  31 ++---
 lib/PublicInbox/Search.pm            |  70 ++++++++--
 lib/PublicInbox/SearchIdx.pm         | 260 +++++++++++++++++++++--------------
 lib/PublicInbox/SearchIdxPart.pm     |   8 +-
 lib/PublicInbox/SearchIdxSkeleton.pm |  27 +---
 lib/PublicInbox/SearchMsg.pm         |  26 ++--
 lib/PublicInbox/V2Writable.pm        | 166 +++++++++++++++++++---
 lib/PublicInbox/View.pm              |   8 +-
 lib/PublicInbox/WwwAtomStream.pm     |   5 +-
 scripts/import_vger_from_mbox        |  11 +-
 t/content_id.t                       |   5 +-
 t/import.t                           |   9 +-
 t/init.t                             |   2 +
 t/mid.t                              |  22 ++-
 t/nntpd.t                            |   2 +
 t/search-thr-index.t                 |   2 +-
 t/v2writable.t                       | 195 ++++++++++++++++++++++++++
 27 files changed, 842 insertions(+), 269 deletions(-)
 create mode 100644 lib/PublicInbox/MsgTime.pm
 create mode 100644 t/v2writable.t

unsubscribe: meta+unsubscr...@public-inbox.org
archive: https://public-inbox.org/meta/

Reply via email to