Has anyone experienced a significant slowdown when adding many (tens / hundreds of thousands) of documents to an Oak repository?
I'm using: <tika.version>1.7</tika.version> <jackrabbit.version>2.14.1</jackrabbit.version> <oak.version>1.8-SNAPSHOT</oak.version> <lucene.version>4.7.1</lucene.version> and creating the repository (Spring Boot app) basically like: MBeanExporter mbe = new MBeanExporter(); mbe.setServer(mbs); mbe.setNamingStrategy(new IdentityNamingStrategy()); GCMonitor gcMonitor = new GCMonitorTracker(); StatisticsProvider statisticsProvider = new MetricStatisticsProvider(mbs, Oak.defaultScheduledExecutor()); FileStoreBuilder fsBuilder = FileStoreBuilder.fileStoreBuilder(new File(repoDirectory)); fsBuilder.withGCMonitor(gcMonitor); fsBuilder.withIOMonitor(new MetricsIOMonitor(statisticsProvider)); fsBuilder.withStatisticsProvider(statisticsProvider); this.fs = fsBuilder.build(); SegmentNodeStoreBuilder nsBuilder = SegmentNodeStoreBuilders.builder(fs); nsBuilder.withStatisticsProvider(statisticsProvider); this.ns = nsBuilder.build(); this.executor = Oak.defaultExecutorService(); this.oak = new Oak(ns); this.oak.with(mbs); this.oak.withAsyncIndexing("async", 5); this.jcr = new Jcr(oak); this.repository = jcr.createRepository(); The basic problem is that I'm doing a data migration (~1 million docs) from a legacy system. When I start inserting the documents into Oak (the folder structure is very flat), it absolutely flies in the beginning, but significantly slows down by the time I get to 75k or so documents (watching the stats in "org.apache.jackrabbit.oak:name=oak.segment.segment-write-time,type=Metrics / OneMinutRate" shows an 80% slowdown over the course of an hour or so. Also noticed / possibly related note -- when I just start out, I see the "async" indexer running and logging as the folders and documents are being created, but it stops logging anything within the first 20k inserts. Should probably also add that the sessions doing the writes are probably adding about 10 file nodes before syncing (ie. not using a session per file or doing it all in one session). The actual inserts are being done using a thread pool with about six workers simultaneously writing the files into Oak. Has anyone else seen similar behavior? Is there anything I should be taking into account when moving so many files at once? Any thoughts would be hugely appreciated. Thanks! -- *Bill Markmann* *President | 866 809 0394 x 701* *Counterpoint Consulting* *Automate. Innovate. Accelerate.* c20g.com | *Blog <http://www.c20g.com/site/blog> **| Linkedin <http://www.linkedin.com/company/counterpoint-consulting-inc.>** | Twitter <https://twitter.com/c20g>*