[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

Nicholas DiPiazza (Jira) Thu, 11 Aug 2022 11:48:04 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicholas DiPiazza updated TIKA-3835:
------------------------------------
    Description: 
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled, *cache* field != false, and parse cache contains 
\{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother. Such as numBytesInBody, etc.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..

  was:
Tika pipes should have an optional configuration to archive parsed results. 
These archived results can be returned in the case that the same exact version 
of a document had already been parsed previously, pull the parsed output from a 
"parse cache" instead of repeating the fetch+parse.

In other words, skip the fetch+parse if you did it previously.

Benefits of this:
 * When the tika pipe fetcher is using a cloud service, documents are rate 
limited heavily. So if you manage to get a document and parse it, storing it 
for future use is very important.
 * Multi tier environments can be populated faster. Example: You are pulling 
data from an app in dev, staging and production. When you run the tika pipe 
job, it will parse each document 1 time. All the other environments can now 
re-use the parsed output - saving days of run time (in my case).
 ** In other words, "full crawls" for your initial tika index on duplicate 
environments is reduced to cache lookups.

So the process would be 
 * pipe iterator has the next document: \{lastUpdated,docID}
 ** pipe iterator documents have an optional field: *cache* _boolean -_ 
default=true. If cache=false, will not cache this doc.
 * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
 ** Get \{lastUpdated,docID} document from the cache and push to the emit queue 
and return.
 * Parse document
 * If parse cache is enabled, and *cache* field != false, put into cache 
key=\{lastUpdated,docID}, value=\{document,metadata}
 ** Additional conditions can dictate what documents we store in the cache and 
what ones we don't bother. Such as numBytesInBody, etc.

The cache would need to be disk or network based storage because of the storage 
size. In-memory cache would not be feasible. 

The parser cache should be based on an interface so that the user can use 
several varieties of implementations such as:
 * File cache
 * S3 implementation cache
 * Others..


> tika pipes parse cache - avoid re-parsing content that has not changed
> ----------------------------------------------------------------------
>
>                 Key: TIKA-3835
>                 URL: https://issues.apache.org/jira/browse/TIKA-3835
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-pipes
>    Affects Versions: 2.2.0
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results. 
> These archived results can be returned in the case that the same exact 
> version of a document had already been parsed previously, pull the parsed 
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
>  * When the tika pipe fetcher is using a cloud service, documents are rate 
> limited heavily. So if you manage to get a document and parse it, storing it 
> for future use is very important.
>  * Multi tier environments can be populated faster. Example: You are pulling 
> data from an app in dev, staging and production. When you run the tika pipe 
> job, it will parse each document 1 time. All the other environments can now 
> re-use the parsed output - saving days of run time (in my case).
>  ** In other words, "full crawls" for your initial tika index on duplicate 
> environments is reduced to cache lookups.
> So the process would be 
>  * pipe iterator has the next document: \{lastUpdated,docID}
>  ** pipe iterator documents have an optional field: *cache* _boolean -_ 
> default=true. If cache=false, will not cache this doc.
>  * if parse cache is enabled, *cache* field != false, and parse cache 
> contains \{lastUpdated,docID}
>  ** Get \{lastUpdated,docID} document from the cache and push to the emit 
> queue and return.
>  * Parse document
>  * If parse cache is enabled, and *cache* field != false, put into cache 
> key=\{lastUpdated,docID}, value=\{document,metadata}
>  ** Additional conditions can dictate what documents we store in the cache 
> and what ones we don't bother. Such as numBytesInBody, etc.
> The cache would need to be disk or network based storage because of the 
> storage size. In-memory cache would not be feasible. 
> The parser cache should be based on an interface so that the user can use 
> several varieties of implementations such as:
>  * File cache
>  * S3 implementation cache
>  * Others..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-3835) tika pipes parse cache - avoid re-parsing content that has not changed

Reply via email to