[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

Nicholas DiPiazza (Jira) Thu, 11 Aug 2022 10:37:04 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicholas DiPiazza updated TIKA-3835:
------------------------------------
    Description: 
Tika pipes should have an optional configuration to archive parsed results.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Emit to the document to the emit queue and return.
 * Parse document
 * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
value=\{document,metadata}

This will drastically improve full crawl times for customers using services 
especially cloud file services with strict rate limits.


> tika pipes parse cache - avoid re-parsing content that has not changed
> ----------------------------------------------------------------------
>
>                 Key: TIKA-3835
>                 URL: https://issues.apache.org/jira/browse/TIKA-3835
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 2.2.0
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
>  ** Emit to the document to the emit queue and return.
>  * Parse document
>  * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services 
> especially cloud file services with strict rate limits.
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

Reply via email to