[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

Tim Allison (Jira) Thu, 11 Aug 2022 10:33:33 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578581#comment-17578581
 ]


Tim Allison commented on TIKA-3835:
-----------------------------------

Are you proposing storing extracted content in memory in the pipesiterator?  
I'd worry about storing even a set of {lastUpdated,docId} unless we bound the 
cache.

Or would the cache write to disk?

In general, I'd want there to be logic in the pipesiterator to know which files 
it actually has to process.  Like with the jdbc one, you can modify the query 
to select the documents with a modified date below a threshold or something 
else.  This is what you did with the Solr pipes iterator.

Apologies...I'm likely misunderstanding this proposal!

> tika pipes parse cache - avoid re-parsing content that has not changed
> ----------------------------------------------------------------------
>
>                 Key: TIKA-3835
>                 URL: https://issues.apache.org/jira/browse/TIKA-3835
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 2.2.0
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
>  ** Emit to the document to the emit queue and return.
>  * Parse document
>  * If parse cache is enabled, put into cache key=\{lastUpdated,docID}, 
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services 
> especially cloud file services with strict rate limits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

Reply via email to