Hi Davide Of course we need to measure the speed and improvements. I am not sure how much time I will have to implement benchmarks for this, but will try.
So far I have not tried the cached Tika full-text extraction, yet. I'm curious how much gain it can provide though. This may make re-indexing so cheap that we need not worry about it any more. I hadn't really paid attention to OAK-2749, but it sounds interesting. Similarly but differently, I was pondering the idea to allow multithreaded tar2mongo copies. Since DocumentMK supports clustering, it should be possible to copy different sub-trees in different threads?! It would indeed be interesting to have a chat some time! But first I'll be on holidays for two weeks :) Regards Julian On Wed, Aug 5, 2015 at 8:57 PM, Davide Giannella <[email protected]> wrote: > On 05/08/2015 17:45, Julian Sedding wrote: >> ... >> >> My aim is to reduce the critical path for migrating one NodeStore >> (incl JR2) to another. Indexing (especially async indexing) takes is a >> big part of the time, so if I can move that out of the critical path, >> it can save a lot of downtime. > > Interesting. I know async index can be lengthy but it would be very > interesting if we could measure what we have now and the improvements > we're making. > > The slowest part of the async index is normally the full-text extraction > as they run in a single thread. With > https://issues.apache.org/jira/browse/OAK-2749 we provided a mechanism > (not used yet AFAIK) to run different indexers on different threads. > Maybe it's something you would like to experiment with as well to speed > up the indexing. > > If you want ping me on chat tomorrow morning (CEST) so we can quickly > see what we can do here. But I think we should start measuring it first :) > > Cheers > Davide > > >
