*Additional notes :* text-content-1 : 51 GB text-conetnt-2 : 74 GB text-content-3 : 66 GB (hosted on node 1)
*List cache hits :* text-content-1 43,752,916 text-content-2 49,538,336 text-content-3 31,605,528 (hosted on node 1) *Compressed tree cache hits :* text-content-1 17,401 text-content-2 17,443 text-content-3 11,471 (hosted on node 1) Regards, Danny On Tue, Aug 21, 2012 at 7:36 AM, Danny Sinang <[email protected]> wrote: > Hi, > > We have a 3 node ML 4.2-6 cluster. > > Since last week, we've seen CPU usage on nodes 2 and 3 skyrocket to around > 90% each from 6 to 10 pm, while node 1 used would hit only 30% at peak > period. > > Now we've seen an influx of new customers and this could explain why the > sudden load during that period. Moreover, it looks like we need to rewrite > some of our code to reduce CPU usage. > > However, what confounds me is why node 1 isn't taking on as much load as > the other nodes. I'm thinking maybe the following events / situations > caused it. Hope somebody here can confirm or point me in the right > direction. > > 1. Expanded Tree Cache Increase / Restart / Seg Fault / Forest Failover > > The night before the CPU usage spike, I had to increase the Expanded Tree > Cache for the cluster from 8 GB to 12 GB (i.e. 12288 MB). This of course > caused the ML cluster to automatically restart. After the restart, > everything looked ok from the application perspective. However, two hours > later, node 1 suddenly encountered multiple "XDMP-OLDSTAMP : Timestamp too > old for forest" and "Segmentation fault" errors and caused multiple > restarts. Eventually, the forests on node 1 did a local-disk failover to > node 2. The following day, we decided to "unfailover" the forest on node 2 > back to node 1. The database status shows everything's back to normal after > that, except of course the uneven load between the nodes. > > > Questions : > > a. Could the forest "unfailover" have missed to tell the cluster that node > 1 is back in business, thus the uneven load ? > > b. Could it be that design the database status showing that the > "unfailover" was successful, node 2 is still serving the content that > failed over to it ? > > c. Could the 12 GB (12288 MB) expanded tree cache be an "uneven" number > causing the multiple restarts and old timestamp and segmentation fault > errors ? > > d. Could the 12 GB expanded tree cache be causing the uneven load across > the 3 nodes ? The expanded tree cache partitions is set to 4. > > > 2. Ingestion of content done only on nodes 2 and 3. > > By design, we validate incoming content on node 1 and ingest (i.e. > document-insert) them on nodes 2 and 3. Could this be causing the content > to be saved only on nodes 2 and 3 ? I was informed months ago that ML > automatically saves documents evenly across all the forests making up a > database. In our case, our production database is made up of 3 forests, > each saved in each of the nodes. > > Looking at the database status, it looks like the forests containing > binary files have almost the same size, but the text forests have varying > sizes as follows : > > text-content-1 : 51 GB > text-conetnt-2 : 74 GB > text-content-3 : 66 GB > > Regards, > Danny > > > > > > > > > > >
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
