[ 
https://issues.apache.org/jira/browse/SOLR-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916125#comment-13916125
 ] 

Hoss Man commented on SOLR-5795:
--------------------------------

Here's the basic design i've been fleshing out in my head...

* A new "{{ExpireDocsUpdateProcessorFactory}}"
** can compute an {{expiration}} field to add to indexed docs based on a 
"{{ttl}}" field in the input doc
*** perhaps could also fallback to a {{ttl}} update request param when bulk 
adding similar to {{\_version\_}} ?
*** {{IgnoreFieldUpdateProcessorFactory}} could be used to remove the {{ttl}} 
if they don't wnat a record in the index of when/why {{expiration_date}} was 
computed
** Can trigger periodic {{deleteByQuery}} on {{expiration}} time field
* rough idea for configuration...{code}
<processor class="solr.ExpireDocsUpdateProcessorFactory">
  <!-- mandatory, must be a date based field in schema.xml -->
  <str name="expiration.fieldName">expire_at</str>
  <!-- optional, default is not to auto-expire docs -->
  <int name="deleteIntervalInSeconds">300</int>
  <!-- optional, default is not to compute expiration automatically 
       if this field doesn't exist in schema, then 
IgnoreFieldUpdateProcessorFactory can be configured to remove it.
    -->
  <str name="ttl.fieldName">ttl</str>
</process>
{code}
* {{ExpireDocsUpdateProcessorFactory.init()}} logic:
** if {{ttl.fieldName}} is specified make a note of it
** validate {{expiration.fieldName}} is set & exists in schema
*** perhaps in managed schema mode create automatically if it doesn't?
** if {{deleteIntervalInSeconds}} is set:
*** spin up a recurring {{ScheduledThreadPoolExecutor}} with a recurring 
{{AutoExpireDocsCallable}}
*** add a core Shutdown hook to shutdown the executor when the core shuts down
* {{ExpireDocsUpdateProcessor.processAdd()}} logic:
** if {{ttl.fieldName}} is configured & doc contains that field name:
*** treat value as datemath from NOW and put computed value in 
{{expiration.fieldName}}
** else: No-Op
* {{AutoExpireDocsCallable}} logic:
** if cloud mode, return No-Op unless we are running on the overseer
** Create a {{DeleteUpdateCommand}} using {{deleteByQuery}} of {{\[* TO NOW\]}} 
using the {{expiration.fieldName}}
*** this can be fired directly against the {{UpdateRequestProcessor}} returned 
by the {{ExpireDocsUpdateProcessorFactory}} itself using a 
{{LocalSolrQueryRequest}}
**** Or perhaps we make an optional configuration so you can specify any chain 
name and we fetch it from the SolrCore?
*** the existing distributed delete logic should ensure it gets distributed 
cleanly in cloud mode
*** NOTE: the executor should run on every node, and only do the overseer check 
when the executor fires, so even when the overseer changes periodically, 
whoever the current overseer is every X minutes will fire the delete.

This, combined with things like {{DefaultValueUpdateProcessorFactory}}, 
{{IgnoreFieldUpdateProcessorFactory}} and 
{{FirstFieldValueUpdateProcessorFactory}} on the {{ttl.fieldName}} and/or 
{{expiration.fieldName}} should allow all sorts of various usecases:

* every doc expires after X amount of time no matter what the client says
* every doc defaults to an ttl of X unless it has a doc explicit ttl
* every doc defaults to an ttl of X unless it has a doc explicit expire date
* docs can optional expire after a ttl specified when they were indexed
* docs can optional expire at an explicit time specified when they were indexed


> Option to periodically delete docs based on an expiration field -- or ttl 
> specified when indexed.
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5795
>                 URL: https://issues.apache.org/jira/browse/SOLR-5795
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>
> A question I get periodically from people is how to automatically remove 
> documents from a collection at a certain time (or after a certain amount of 
> time).  
> Excluding from search results using a filter query on a date field is 
> trivial, but you still have to periodically send a deleteByQuery to clean up 
> those older "expired" documents.  And in the case where you want all 
> documents to auto-expire some fixed amount of time when they were indexed, 
> you still have to setup a simple UpdateProcessorto set that expiration date.  
> So i've been thinking it would be nice if there was a simple way to configure 
> solr to do it all for you.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to