[
https://issues.apache.org/jira/browse/TIKA-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicholas DiPiazza updated TIKA-3835:
------------------------------------
Description:
Tika pipes should have an optional configuration to archive parsed results.
These archived results can be returned in the case that the same exact version
of a document had already been parsed previously, pull the parsed output from a
"parse cache" instead of repeating the fetch+parse.
In other words, skip the fetch+parse if you did it previously.
Benefits of this:
* When the tika pipe fetcher is using a cloud service, documents are rate
limited heavily. So if you manage to get a document and parse it, storing it
for future use is very important.
* Multi tier environments can be populated faster. Example: You are pulling
data from an app in dev, staging and production. When you run the tika pipe
job, it will parse each document 1 time. All the other environments can now
re-use the parsed output - saving days of run time (in my case).
** In other words, "full crawls" for your initial tika index on duplicate
environments is reduced to cache lookups.
So the process would be
* pipe iterator has the next document: \{lastUpdated,docID}
* if parse cache is enabled and parse cache contains \{lastUpdated,docID}
** Emit to the document to the emit queue and return.
* Parse document
* If parse cache is enabled, put into cache key=\{lastUpdated,docID},
value=\{document,metadata}
This will drastically improve full crawl times for customers using services
especially cloud file services with strict rate limits.
The parser cache should be based on an interface so that the user can use
several varieties of implementations such as:
* File cache
* S3 implementation cache
* Others..
was:
Tika pipes should have an optional configuration to archive parsed results.
These archived results can be returned in the case that the same exact version
of a document had already been parsed previously, pull the parsed output from a
"parse cache" instead of repeating the fetch+parse.
In other words, skip the fetch+parse if you did it previously.
Benefits of this:
* When the tika pipe fetcher is using a cloud service, documents are rate
limited heavily. So if you manage to get a document and parse it, storing it
for future use is very important.
* Multi tier environments can be populated faster. Example: You are pulling
data from an app in dev, staging and production. When you run the tika pipe
job, it will parse each document 1 time. All the other environments can now
re-use the parsed output - saving days run time (in my case).
** In other words, "full crawls" for your initial tika index on duplicate
environments is reduced to cache lookups.
So the process would be
* pipe iterator has the next document: \{lastUpdated,docID}
* if parse cache is enabled and parse cache contains \{lastUpdated,docID}
** Emit to the document to the emit queue and return.
* Parse document
* If parse cache is enabled, put into cache key=\{lastUpdated,docID},
value=\{document,metadata}
This will drastically improve full crawl times for customers using services
especially cloud file services with strict rate limits.
The parser cache should be based on an interface so that the user can use
several varieties of implementations such as:
* File cache
* S3 implementation cache
* Others..
> tika pipes parse cache - avoid re-parsing content that has not changed
> ----------------------------------------------------------------------
>
> Key: TIKA-3835
> URL: https://issues.apache.org/jira/browse/TIKA-3835
> Project: Tika
> Issue Type: New Feature
> Components: tika-pipes
> Affects Versions: 2.2.0
> Reporter: Nicholas DiPiazza
> Priority: Major
>
> Tika pipes should have an optional configuration to archive parsed results.
> These archived results can be returned in the case that the same exact
> version of a document had already been parsed previously, pull the parsed
> output from a "parse cache" instead of repeating the fetch+parse.
> In other words, skip the fetch+parse if you did it previously.
> Benefits of this:
> * When the tika pipe fetcher is using a cloud service, documents are rate
> limited heavily. So if you manage to get a document and parse it, storing it
> for future use is very important.
> * Multi tier environments can be populated faster. Example: You are pulling
> data from an app in dev, staging and production. When you run the tika pipe
> job, it will parse each document 1 time. All the other environments can now
> re-use the parsed output - saving days of run time (in my case).
> ** In other words, "full crawls" for your initial tika index on duplicate
> environments is reduced to cache lookups.
> So the process would be
> * pipe iterator has the next document: \{lastUpdated,docID}
> * if parse cache is enabled and parse cache contains \{lastUpdated,docID}
> ** Emit to the document to the emit queue and return.
> * Parse document
> * If parse cache is enabled, put into cache key=\{lastUpdated,docID},
> value=\{document,metadata}
> This will drastically improve full crawl times for customers using services
> especially cloud file services with strict rate limits.
> The parser cache should be based on an interface so that the user can use
> several varieties of implementations such as:
> * File cache
> * S3 implementation cache
> * Others..
--
This message was sent by Atlassian Jira
(v8.20.10#820010)