Anton Khirnov <[email protected]> writes:
> Hi,
> this patchset adds support for indexing the contents of attachments, by
> piping the payload to an external user-specified program. I have chosen
> this approach for two main reasons:
> * converting non-text data into text is inherently subjective, and also
> a moving target, so I wanted to avoid hardcoding any conversion policy
> inside libnotmuch
> * parsers are notoriously bug-prone and have been subject to vast
> numbers of security issues in the past; since we will index untrusted
> data from the internet, is is highly important that we can sandbox the
> parsing code
>
I sent a few nitpicks while scanning the series.
Some general comments / questions:
- How will we test this code? I guess we can run without sandboxing
in the test suite if we curate a collection of safe attachements
to index.
- Have you measured the performance impact? It should be possible
to run some kind of dummy filter that always returns lorem ipsum
or something just for the performance test suite.
- Patch 5/7 needs review by someone familiar with pipe(3) and
associated machinery. I'm not sure if I'm that person, but maybe
things will become more clear when I read the details.
> There was some discussion on IRC about reusing Xapian Omega's parsers,
> but after looking at it I concluded that a substantial amount of work
> would be needed to separate them out to make them usable for notmuch in
> a safe manner. If someone volunteers to do that in the future, it should
> still be possible to extract them as a standalone filter compatible with
> this patchset.
I think my idea was more that there would be a library interface to the
omega external parser wrapper. If that happens we could configure
"built-in" versus "external" attachment filtering.
> Sandboxes are unfortunately highly non-portable - I wrote a sample
> Firejail profile because that's what I use, but it is Linux-only.
With my Debian maintainer hat on, is your idea that linux distros would
install this? In that case I prefer not to add more things to contrib/
that are really "installed sometimes". Or do we expect users to "do the
right thing"?
I feel like putting things in contrib/ into the "official documentation"
commits us to supporting them long term, which is kindof the opposite of
what contrib should mean.
> The example filtering program is currently quite trivial - it delegates
> to pdftotext for PDF and elinks or w3m for HTML. We may want to add more
> sophistication, handling *office files, archives, autodetecting faulty
> mimetypes, etc. I can do some of that and add tests if the basic
> approach is approved.
Although I haven't reviewed the main new code in detail, the overall
approach seems OK.
_______________________________________________
notmuch mailing list -- [email protected]
To unsubscribe send an email to [email protected]