[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

Nicholas DiPiazza (Jira) Wed, 07 Sep 2022 11:23:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601463#comment-17601463
 ]


Nicholas DiPiazza commented on TIKA-3835:
-----------------------------------------

Yeah quickly realizing in my case, because i have solr already, it's better to 
just store the parsed output in solr than s3. though the s3 option is good. so 
an interface that supports caching where ever you want is good.

> tika pipes parse cache - avoid re-parsing content that has not changed
> ----------------------------------------------------------------------
>
>                 Key: TIKA-3835
>                 URL: https://issues.apache.org/jira/browse/TIKA-3835
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 2.2.0
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  ** pipe iterator documents have an optional field: *cache* _boolean -_ 
> default=true. If cache=false, will not cache this doc.
>  * if parse cache is enabled, *cache* field != false, and parse cache 
> contains \{lastUpdated,docID}
>  ** Get \{lastUpdated,docID} document from the cache and push to the emit 
> queue and return.
>  * Parse document
>  * If parse cache is enabled, and *cache* field != false, put into cache 
> key=\{lastUpdated,docID}, value=\{document,metadata}
>  ** Additional conditions can dictate what documents we store in the cache 
> and what ones we don't bother. Such as numBytesInBody, etc.
> The cache would need to be disk or network based storage because of the 
> storage size. In-memory cache would not be feasible. 
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

Reply via email to