[
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578581#comment-17578581
]
Tim Allison commented on TIKA-3835:
-----------------------------------
Are you proposing storing extracted content in memory in the pipesiterator?
I'd worry about storing even a set of {lastUpdated,docId} unless we bound the
cache.
Or would the cache write to disk?
In general, I'd want there to be logic in the pipesiterator to know which files
it actually has to process. Like with the jdbc one, you can modify the query
to select the documents with a modified date below a threshold or something
else. This is what you did with the Solr pipes iterator.
Apologies...I'm likely misunderstanding this proposal!
> tika pipes parse cache - avoid re-parsing content that has not changed
> ----------------------------------------------------------------------
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Affects Versions: 2.2.0
> Reporter: Nicholas DiPiazza
> Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results.
> So the process would beĀ
> * pipe iterator has the next document: \{lastUpdated,docID}
> * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
> ** Emit to the document to the emit queue and return.
> * Parse document
> * If parse cache is enabled, put into cache key=\{lastUpdated,docID},
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services
> especially cloud file services with strict rate limits.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)