Hi, Assuming the MongoDB instance is performing well and does not show any slow queries in the mongodb logs, running the index operation on many cores, each core handling one index writer should parallelise the operation. IIRC this is theoretically possible, and might have been implemented in the latest versions of Oak (Chetan?). If you are in AWS then a X1 instance will give you 128 cores and upto 2TB of ram, for the duration of the re-index. Other cloud vendors have equivalent VMS. Whatever the instance is, the Oak cluster leader should be allocated to this instance as IIRC only the Oak cluster leader performs the index operation. The single threaded index writer is a feature/limitation of the way Lucene works, but Oak has many independent indexes. your deployment may not have 128 so may not be able to use all the cores of the largest instance.
If however, the MongoDB cluster is showing any signs or slow queries in the logs (> 100ms), any level of read IO then however many cores over however many VMs wont speed the process up and may slow the process down. To be certain of no bottleneck in MongoDB, ensure the VM has more memory than the disk size of the database. The latest version of MongoDB supported by Oak, running WiredTiger will greatly reduce memory pressure and IO as it doesnt use memory mapping as the primary DB to disk mechanism, and compresses the data as it writes. The instance running Oak must also be sized correctly. I suspect you will be running a persistent cache which must be sized to give optimum performance and minimise IO, which then also requires sufficient memory. For the period of the re-index, the largest AEM instance you can afford will minimise IO. Big VMs (in AWS at least) have more network bandwidth which also helps. Finally disks. Dont use HDD, only use SSD and ensure that there is sufficient IOPS available at all times, and enable all the Oak indexing optimisation switches (copyOnRead, copyOnWrite etc) IO generally kills performance, and if the VMs have not been configured (THP off, readhead low, XFS or noatime ext4 disks) then that IO will be amplified. If you have done all of this, then you might have to wait for OAK-6246 (I see Chetan just responded), but if you haven't please do check that you are running as fast as possible with no constrained resources. HTH, if its been said before sorry for the noise and please ignore. Best Regards Ian On 9 June 2017 at 07:49, Alvaro Cabrerizo <[email protected]> wrote: > Thanks Chetan, > > Sorry, but that part is out of my reach. There is an IT team in charge of > managing the infrastructure and make optimizations, so It is difficult to > get that information. Basically what is was looking for is the way > to parallelize the indexing process. On the other hand, reducing the > indexing time would be fine (it was previously reduced from 7 to 2 days), > but I think that traversing more than 100000000 nodes is a pretty tough > operation and I'm not sure if there is much we can do. Anyway, any pointer > related to indexing optimization or any advice on how to design the repo > (e.g. use different paths to isolate different groups of assets, use > different nodetypes to differentiate content type, create different > repositories [is that possible?] for different groups of uses...) is > welcome. > > Regards. > > On Thu, Jun 8, 2017 at 12:44 PM, Chetan Mehrotra < > [email protected]> > wrote: > > > On Thu, Jun 8, 2017 at 4:04 PM, Alvaro Cabrerizo <[email protected]> > > wrote: > > > It is a DocumentNodeStore based instance. We don't extract data from > > binary > > > files, just indexing metadata stored on nodes. > > > > In that case 48 hrs is a long time. Can you share some details around > > how many nodes are being indexed as part of that index and the repo > > size in terms of Mongo stats if possible? > > > > Chetan Mehrotra > > >
