[
https://issues.apache.org/jira/browse/OAK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Marth updated OAK-2892:
-------------------------------
Fix Version/s: (was: 1.0.15)
> Speed up lucene indexing post migration by pre extracting the text content
> from binaries
> ----------------------------------------------------------------------------------------
>
> Key: OAK-2892
> URL: https://issues.apache.org/jira/browse/OAK-2892
> Project: Jackrabbit Oak
> Issue Type: New Feature
> Components: lucene, run
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Labels: performance
> Fix For: 1.3.1
>
>
> While migrating large repositories say having 3 M docs (250k PDF) Lucene
> indexing takes long time to complete (at time 4 days!). Currently the text
> extraction logic is coupled with Lucene indexing and hence is performed in a
> single threaded mode which slows down the indexing process. Further if the
> reindexing has to be triggered it has to be done all over again.
> To speed up the Lucene indexing we can decouple the text extraction
> from actual indexing. It is partly based on discussion on OAK-2787
> # Introduce a new ExtractedTextProvider which can provide extracted text for
> a given Blob instance
> # In oak-run introduce a new indexer mode - This would take a path in
> repository and would then traverse the repository and look for existing
> binaries and extract text from that
> So before or after migration is done one can run this oak-run tool to create
> this store which has the text already extracted. Then post startup we need to
> wire up the ExtractedTextProvider instance (which is backed by the BlobStore
> populated before) and indexing logic can just get content from that. This
> would avoid performing expensive text extraction in the indexing thread.
> See discussion thread http://markmail.org/thread/ndlfpkwfgpey6o66
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)