Hi,
Assuming the MongoDB instance is performing well and does not show any slow
queries in the mongodb logs, running the index operation on many cores,
each core handling one index writer should parallelise the operation. IIRC
this is theoretically possible, and might have been implemented in the
latest versions of Oak (Chetan?). If you are in AWS then a X1 instance will
give you 128 cores and upto 2TB of ram, for the duration of the re-index.
Other cloud vendors have equivalent VMS. Whatever the instance is, the Oak
cluster leader should be allocated to this instance as IIRC only the Oak
cluster leader performs the index operation. The single threaded index
writer is a feature/limitation of the way Lucene works, but Oak has many
independent indexes. your deployment may not have 128 so may not be able to
use all the cores of the largest instance.

If however, the MongoDB cluster is showing any signs or slow queries in the
logs (> 100ms), any level of read IO then however many cores over however
many VMs wont speed the process up and may slow the process down. To be
certain of no bottleneck in MongoDB, ensure the VM has more memory than the
disk size of the database. The latest version of MongoDB supported by Oak,
running WiredTiger will greatly reduce memory pressure and IO as it doesnt
use memory mapping as the primary DB to disk mechanism, and compresses the
data as it writes.

The instance running Oak must also be sized correctly. I suspect you will
be running a persistent cache which must be sized to give optimum
performance and minimise IO, which then also requires sufficient memory.
For the period of the re-index, the largest AEM instance you can afford
will minimise IO.  Big VMs (in AWS at least) have more network bandwidth
which also helps.

Finally disks. Dont use HDD, only use SSD and ensure that there is
sufficient IOPS available at all times, and enable all the Oak indexing
optimisation switches (copyOnRead, copyOnWrite etc)

IO generally kills performance, and if the VMs have not been configured
(THP off, readhead low, XFS or noatime ext4 disks) then that IO will be
amplified.

If you have done all of this, then you might have to wait for OAK-6246 (I
see Chetan just responded), but if you haven't please do check that you are
running as fast as possible with no constrained resources.

HTH, if its been said before sorry for the noise and please ignore.
Best Regards
Ian

On 9 June 2017 at 07:49, Alvaro Cabrerizo <[email protected]> wrote:

> Thanks Chetan,
>
> Sorry, but that part is out of my reach. There is an IT team in charge of
> managing the infrastructure and make optimizations, so It is difficult to
> get that information. Basically what is was looking for is the way
> to parallelize the indexing process. On the other hand, reducing the
> indexing time would be fine (it was previously reduced from 7 to 2 days),
> but I think that traversing more than 100000000 nodes is a pretty tough
> operation and I'm not sure if there is much we can do. Anyway, any pointer
> related to indexing optimization or any advice on how to design the repo
> (e.g. use different paths to isolate different groups of assets, use
> different nodetypes to differentiate content type, create different
> repositories [is that possible?] for different groups of uses...) is
> welcome.
>
> Regards.
>
> On Thu, Jun 8, 2017 at 12:44 PM, Chetan Mehrotra <
> [email protected]>
> wrote:
>
> > On Thu, Jun 8, 2017 at 4:04 PM, Alvaro Cabrerizo <[email protected]>
> > wrote:
> > > It is a DocumentNodeStore based instance. We don't extract data from
> > binary
> > > files, just indexing metadata stored on nodes.
> >
> > In that case 48 hrs is a long time. Can you share some details around
> > how many nodes are being indexed as part of that index and the repo
> > size in terms of Mongo stats if possible?
> >
> > Chetan Mehrotra
> >
>

Reply via email to