Yup. It might be nice to have a separate internal database of file -> mimetype, and then just specify mimetypes?
Also it does not download and save 'in-reply-to'. Without these you cannot reconstruct mail threads. Tika also has a mailbox file parser. It would be great if both created the exact same document. I don't know if they do now. On Thu, Nov 18, 2010 at 8:46 AM, Peter Sturge <[email protected]> wrote: > Hi Solr folks, > > I admit I'm new to DIH, so I thought I'd put this out there before > generating a Jira issue: > > I've been doing some work with importing emails using the truly > fabulous MailEntityProcessor. Fantastic! > > I have noticed, however, that in order to retrieve email content into > the Solr 'content' field, the <entity> processAttachement="true" > property attribute must be set. > While, strictly speaking in the mime world, the content is a body > part, I'm sure I'm not the only one with a use case of wanting to have > the content, but not [necessarily] the attachments. > > The MailEntityProcessor.java code has the content processing *after* > the check for the processAttachement="true". > > What I propose is this: > > 1. Add a new [optional] boolean property called: includeContent. If > 'true' the content field would be populated with the (non-attachment) > content of the message. If 'false', the content is not included. > 'processAttachement' would behave the same as it does now, but only > for attachments, not text content. I would propose that > includeContent="true" be the default behaviour. > 2. Add an additional property attribute called 'processAttachments' > that is a synonym for the mis-spelled and singular > 'processAttachement'. processAttachement would remain for bwd compat. > 3. It could be nice to have a built-in 'attachmentsPassthrough' and/or > 'attachmentsFilter' attribute so that only matching attachment > filenames would be processed (e.g. > attachmentsPassthrough="*.gz,*.xls,*.pdf,*.txt" > attachmentsFilter="*.gif,*.jpg,*.png"). > Tika can spend a fair amount of time churning through attachments, > and if for example, there's a lot of graphics files attached, it would > be more efficient to simply skip them if configured to do so. > Be good to hear others' thoughts on this one > > Comments, thoughts, please? > > Thanks, > Peter > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Lance Norskog [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
