Re: minimize the impact when creating a new index (or re-indexing)

Alvaro Cabrerizo Fri, 09 Jun 2017 01:48:07 -0700

Thank you guys for your comments.

On Fri, Jun 9, 2017 at 9:50 AM, Ian Boston <[email protected]> wrote:


> Hi,
> Assuming the MongoDB instance is performing well and does not show any slow
> queries in the mongodb logs, running the index operation on many cores,
> each core handling one index writer should parallelise the operation. IIRC
> this is theoretically possible, and might have been implemented in the
> latest versions of Oak (Chetan?). If you are in AWS then a X1 instance will
> give you 128 cores and upto 2TB of ram, for the duration of the re-index.
> Other cloud vendors have equivalent VMS. Whatever the instance is, the Oak
> cluster leader should be allocated to this instance as IIRC only the Oak
> cluster leader performs the index operation. The single threaded index
> writer is a feature/limitation of the way Lucene works, but Oak has many
> independent indexes. your deployment may not have 128 so may not be able to
> use all the cores of the largest instance.
>
> If however, the MongoDB cluster is showing any signs or slow queries in the
> logs (> 100ms), any level of read IO then however many cores over however
> many VMs wont speed the process up and may slow the process down. To be
> certain of no bottleneck in MongoDB, ensure the VM has more memory than the
> disk size of the database. The latest version of MongoDB supported by Oak,
> running WiredTiger will greatly reduce memory pressure and IO as it doesnt
> use memory mapping as the primary DB to disk mechanism, and compresses the
> data as it writes.
>
> The instance running Oak must also be sized correctly. I suspect you will
> be running a persistent cache which must be sized to give optimum
> performance and minimise IO, which then also requires sufficient memory.
> For the period of the re-index, the largest AEM instance you can afford
> will minimise IO.  Big VMs (in AWS at least) have more network bandwidth
> which also helps.
>
> Finally disks. Dont use HDD, only use SSD and ensure that there is
> sufficient IOPS available at all times, and enable all the Oak indexing
> optimisation switches (copyOnRead, copyOnWrite etc)
>
> IO generally kills performance, and if the VMs have not been configured
> (THP off, readhead low, XFS or noatime ext4 disks) then that IO will be
> amplified.
>
> If you have done all of this, then you might have to wait for OAK-6246 (I
> see Chetan just responded), but if you haven't please do check that you are
> running as fast as possible with no constrained resources.
>
> HTH, if its been said before sorry for the noise and please ignore.
> Best Regards
> Ian
>
> On 9 June 2017 at 07:49, Alvaro Cabrerizo <[email protected]> wrote:
>
> > Thanks Chetan,
> >
> > Sorry, but that part is out of my reach. There is an IT team in charge of
> > managing the infrastructure and make optimizations, so It is difficult to
> > get that information. Basically what is was looking for is the way
> > to parallelize the indexing process. On the other hand, reducing the
> > indexing time would be fine (it was previously reduced from 7 to 2 days),
> > but I think that traversing more than 100000000 nodes is a pretty tough
> > operation and I'm not sure if there is much we can do. Anyway, any
> pointer
> > related to indexing optimization or any advice on how to design the repo
> > (e.g. use different paths to isolate different groups of assets, use
> > different nodetypes to differentiate content type, create different
> > repositories [is that possible?] for different groups of uses...) is
> > welcome.
> >
> > Regards.
> >
> > On Thu, Jun 8, 2017 at 12:44 PM, Chetan Mehrotra <
> > [email protected]>
> > wrote:
> >
> > > On Thu, Jun 8, 2017 at 4:04 PM, Alvaro Cabrerizo <[email protected]>
> > > wrote:
> > > > It is a DocumentNodeStore based instance. We don't extract data from
> > > binary
> > > > files, just indexing metadata stored on nodes.
> > >
> > > In that case 48 hrs is a long time. Can you share some details around
> > > how many nodes are being indexed as part of that index and the repo
> > > size in terms of Mongo stats if possible?
> > >
> > > Chetan Mehrotra
> > >
> >
>

Re: minimize the impact when creating a new index (or re-indexing)

Reply via email to