Eric Wong <e...@80x24.org> writes: > While MTAs seem to stop '\0' from appearing in headers, users > fetching archives via git remain susceptible to having '\0' land > in archives. So we'll filter them out of Xapian and SQLite DBs > to avoid interopability problems with CLI tools since there's no > known messages in lore or any of my archives which feature them. > > Avoiding '\0' will ensure all indexed Message-IDs and List-Ids > can be specified from the command-line (although some characters > will still require $(printf) contortions). > > As with Message-ID, List-Id fields with /\n\t\r/ characters will > also be stripped for indexing. I will assume whatever went wrong > with the References: header in > <https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw> > could also happen to the List-Id header. > > This is inspired by commit aca47e05a6026c12c768753c87e6ff769ef6bee4 > (Import: Don't copy nulls from emails into git, 2018-07-07)
That seems reasonable to me. Eric > --- > lib/PublicInbox/MID.pm | 2 +- > lib/PublicInbox/SearchIdx.pm | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm > index 97cf3a54..36c05855 100644 > --- a/lib/PublicInbox/MID.pm > +++ b/lib/PublicInbox/MID.pm > @@ -115,7 +115,7 @@ sub uniq_mids ($;$) { > my @ret; > $seen ||= {}; > foreach my $mid (@$mids) { > - $mid =~ tr/\n\t\r//d; > + $mid =~ tr/\n\t\r\0//d; > if (length($mid) > MAX_MID_SIZE) { > warn "Message-ID: <$mid> too long, truncating\n"; > $mid = substr($mid, 0, MAX_MID_SIZE); > diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm > index 32598b7c..f569428c 100644 > --- a/lib/PublicInbox/SearchIdx.pm > +++ b/lib/PublicInbox/SearchIdx.pm > @@ -414,6 +414,7 @@ sub index_list_id ($$$) { > for my $l ($hdr->header_raw('List-Id')) { > $l =~ /<([^>]+)>/ or next; > my $lid = lc $1; > + $lid =~ tr/\n\t\r\0//d; # same rules as Message-ID > $doc->add_boolean_term('G' . $lid); > index_phrase($self, $lid, 1, 'XL'); # probabilistic > }