Re: [PATCH 0/7] indexing attachment contents

Anton Khirnov Sun, 25 Jan 2026 02:29:43 -0800

Quoting David Bremner (2026-01-25 01:48:29)
> Anton Khirnov <[email protected]> writes:
> 
> > Hi,
> > this patchset adds support for indexing the contents of attachments, by
> > piping the payload to an external user-specified program. I have chosen
> > this approach for two main reasons:
> > * converting non-text data into text is inherently subjective, and also
> >   a moving target, so I wanted to avoid hardcoding any conversion policy
> >   inside libnotmuch
> > * parsers are notoriously bug-prone and have been subject to vast
> >   numbers of security issues in the past; since we will index untrusted
> >   data from the internet, is is highly important that we can sandbox the
> >   parsing code
> >
> 
> I sent a few nitpicks while scanning the series.
> 
> Some general comments / questions:
> 
>      - How will we test this code? I guess we can run without sandboxing
>        in the test suite if we curate a collection of safe attachements
>        to index.


Yes, my intent was to use trivial filters like passing through the input
unchanged, or returning a fixed string. I'll see about adding some tests
for the next patchset.

>      - Have you measured the performance impact? It should be possible
>        to run some kind of dummy filter that always returns lorem ipsum
>        or something just for the performance test suite.

Not yet, I'll try to get some numbers for the next version. I'd expect
the raw overhead (i.e. with a trivial filter) to be substantial, since
we're spawning a process and indexing is otherwise very fast.

>      - Patch 5/7 needs review by someone familiar with pipe(3) and
>        associated machinery. I'm not sure if I'm that person, but maybe
>        things will become more clear when I read the details.

It's really quite straightforward. pipe() creates an anonymous pipe and
gives you two FDs corresponding to the read and write ends. The code
makes two such pipes for input and output - the child and the parent
each get the read FD of one pipe and write of the other.

The only parts that could be a little tricky are:
- the nonblocking behaviour, since we do not want to impose any
  behavioural constraints on the filter - it should be allowed to
  interleave reads from its stdin and writes to its stdout in any way it
  wants, including not looking at stdin at all
- in one of many questionable legacy POSIX behaviours, writing to a pipe
  whose other end is closed triggers SIGPIPE rather than returning an
  error; one needs to block this signal in order to get sane behaviour

> > There was some discussion on IRC about reusing Xapian Omega's parsers,
> > but after looking at it I concluded that a substantial amount of work
> > would be needed to separate them out to make them usable for notmuch in
> > a safe manner. If someone volunteers to do that in the future, it should
> > still be possible to extract them as a standalone filter compatible with
> > this patchset.
> 
> I think my idea was more that there would be a library interface to the
> omega external parser wrapper. If that happens we could configure
> "built-in" versus "external" attachment filtering.

Sure, no objections here.

> 
> > Sandboxes are unfortunately highly non-portable - I wrote a sample
> > Firejail profile because that's what I use, but it is Linux-only.
> 
> With my Debian maintainer hat on, is your idea that linux distros would
> install this? In that case I prefer not to add more things to contrib/
> that are really "installed sometimes". Or do we expect users to "do the
> right thing"?

My intent was to install them as part of the documentation and have
users plug them into their setup manually, at least initially.
Eventually, if they got mature enough, we might consider making them
properly installed binaries, and possibly even enable this by default,
but that would require adding support for the various sandboxing
solutions available on different platforms. And there's still the
question of people having their own opinions on how to convert things to
text, so there'd probably need to be a way to override built-in
conversions.

> I feel like putting things in contrib/ into the "official documentation"
> commits us to supporting them long term, which is kindof the opposite of
> what contrib should mean.

But if I don't mention them, people won't find them.

> > The example filtering program is currently quite trivial - it delegates
> > to pdftotext for PDF and elinks or w3m for HTML. We may want to add more
> > sophistication, handling *office files, archives, autodetecting faulty
> > mimetypes, etc. I can do some of that and add tests if the basic
> > approach is approved.
> 
> Although I haven't reviewed the main new code in detail, the overall
> approach seems OK.

Thank you for the review. I'll try to send a new patchset version with
tests within a couple weeks.

Cheers,
-- 
Anton Khirnov
_______________________________________________
notmuch mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Re: [PATCH 0/7] indexing attachment contents

Reply via email to