Eric Wong <e...@80x24.org> writes:

> While MTAs seem to stop '\0' from appearing in headers, users
> fetching archives via git remain susceptible to having '\0' land
> in archives.  So we'll filter them out of Xapian and SQLite DBs
> to avoid interopability problems with CLI tools since there's no
> known messages in lore or any of my archives which feature them.
>
> Avoiding '\0' will ensure all indexed Message-IDs and List-Ids
> can be specified from the command-line (although some characters
> will still require $(printf) contortions).
>
> As with Message-ID, List-Id fields with /\n\t\r/ characters will
> also be stripped for indexing.  I will assume whatever went wrong
> with the References: header in
> <https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw>
> could also happen to the List-Id header.
>
> This is inspired by commit aca47e05a6026c12c768753c87e6ff769ef6bee4
> (Import: Don't copy nulls from emails into git, 2018-07-07)

That seems reasonable to me.

Eric


> ---
>  lib/PublicInbox/MID.pm       | 2 +-
>  lib/PublicInbox/SearchIdx.pm | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
> index 97cf3a54..36c05855 100644
> --- a/lib/PublicInbox/MID.pm
> +++ b/lib/PublicInbox/MID.pm
> @@ -115,7 +115,7 @@ sub uniq_mids ($;$) {
>       my @ret;
>       $seen ||= {};
>       foreach my $mid (@$mids) {
> -             $mid =~ tr/\n\t\r//d;
> +             $mid =~ tr/\n\t\r\0//d;
>               if (length($mid) > MAX_MID_SIZE) {
>                       warn "Message-ID: <$mid> too long, truncating\n";
>                       $mid = substr($mid, 0, MAX_MID_SIZE);
> diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
> index 32598b7c..f569428c 100644
> --- a/lib/PublicInbox/SearchIdx.pm
> +++ b/lib/PublicInbox/SearchIdx.pm
> @@ -414,6 +414,7 @@ sub index_list_id ($$$) {
>       for my $l ($hdr->header_raw('List-Id')) {
>               $l =~ /<([^>]+)>/ or next;
>               my $lid = lc $1;
> +             $lid =~ tr/\n\t\r\0//d; # same rules as Message-ID
>               $doc->add_boolean_term('G' . $lid);
>               index_phrase($self, $lid, 1, 'XL'); # probabilistic
>       }

Reply via email to