Anton Khirnov <[email protected]> writes:

> Hi,
> this patchset adds support for indexing the contents of attachments, by
> piping the payload to an external user-specified program. I have chosen
> this approach for two main reasons:
> * converting non-text data into text is inherently subjective, and also
>   a moving target, so I wanted to avoid hardcoding any conversion policy
>   inside libnotmuch
> * parsers are notoriously bug-prone and have been subject to vast
>   numbers of security issues in the past; since we will index untrusted
>   data from the internet, is is highly important that we can sandbox the
>   parsing code
>

I sent a few nitpicks while scanning the series.

Some general comments / questions:

     - How will we test this code? I guess we can run without sandboxing
       in the test suite if we curate a collection of safe attachements
       to index.

     - Have you measured the performance impact? It should be possible
       to run some kind of dummy filter that always returns lorem ipsum
       or something just for the performance test suite.

     - Patch 5/7 needs review by someone familiar with pipe(3) and
       associated machinery. I'm not sure if I'm that person, but maybe
       things will become more clear when I read the details.

> There was some discussion on IRC about reusing Xapian Omega's parsers,
> but after looking at it I concluded that a substantial amount of work
> would be needed to separate them out to make them usable for notmuch in
> a safe manner. If someone volunteers to do that in the future, it should
> still be possible to extract them as a standalone filter compatible with
> this patchset.

I think my idea was more that there would be a library interface to the
omega external parser wrapper. If that happens we could configure
"built-in" versus "external" attachment filtering.

> Sandboxes are unfortunately highly non-portable - I wrote a sample
> Firejail profile because that's what I use, but it is Linux-only.

With my Debian maintainer hat on, is your idea that linux distros would
install this? In that case I prefer not to add more things to contrib/
that are really "installed sometimes". Or do we expect users to "do the
right thing"?

I feel like putting things in contrib/ into the "official documentation"
commits us to supporting them long term, which is kindof the opposite of
what contrib should mean.

> The example filtering program is currently quite trivial - it delegates
> to pdftotext for PDF and elinks or w3m for HTML. We may want to add more
> sophistication, handling *office files, archives, autodetecting faulty
> mimetypes, etc. I can do some of that and add tests if the basic
> approach is approved.

Although I haven't reviewed the main new code in detail, the overall
approach seems OK.
_______________________________________________
notmuch mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to