[
https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timothy Potter updated SOLR-2245:
---------------------------------
Attachment: SOLR-2245.patch
Here's an updated patch that's close to being ready for commit. However, I've
changed a few things in the implementation but I believe it still meets the
spirit of Peter's original work. Mainly, this patch removes support for the
delta-import command and instead only does full-import with support for using
the last_index_time from the previous run as the value for the fetchMailsSince
filter.
The delta-import stuff is really for importing updates to existing rows and the
MailEntityProcessor was sort of hijacking that behavior. More to the point, I
couldn't get the DocBuilder#collectDelta code to work with the rows generated
by the MailEntityProcessor#nextModifiedRowKey. Put simply, nextModifiedRowKey
was returning new mails that occurred after the fetchMailsSince date filter and
the DocBuilder was processing them like they were updates to pre-existing rows.
Thus, I felt is better to just support full-import and then have the code set
the fetchMailsSince filter based on the last_index_time set by the DIH
framework, which gets persisted in dataimport.properties. Of course if that
property is not set, then the code falls back to fetchMailsSince from the
config.
> MailEntityProcessor Update
> --------------------------
>
> Key: SOLR-2245
> URL: https://issues.apache.org/jira/browse/SOLR-2245
> Project: Solr
> Issue Type: Improvement
> Components: contrib - DataImportHandler
> Affects Versions: 1.4, 1.4.1
> Reporter: Peter Sturge
> Assignee: Timothy Potter
> Priority: Minor
> Fix For: 4.9, 5.0
>
> Attachments: SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.patch,
> SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.zip
>
>
> This patch addresses a number of issues in the MailEntityProcessor
> contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to
> be included independently of processing attachments
> e.g. <entity includeContent="true" processAttachments="false" . . . />
> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the
> mis-spelled (and singular) 'processAttachement' property. This property
> functions the same as processAttachement. Default= 'true' - if either is
> false, then attachments are not processed. Note that only one of these should
> really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is
> unread, not deleted etc.), there is still a property value stored in the
> 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients
> that expect the absence of the 'flags' field to mean 'Not read'. I'm
> calculating this would be extremely rare, and is inadviasable in any case as
> user flags can be arbitrarily set, so fixing it up now will ensure future
> client access will be consistent.
> * The folder name of an email is now included as a field called 'folder'
> (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing
> processing
> * The addPartToDocument() method that processes attachments is significantly
> re-written, as there looked to be no real way the existing code would ever
> actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]