Hi, We have a 3 node ML 4.2-6 cluster.
Since last week, we've seen CPU usage on nodes 2 and 3 skyrocket to around 90% each from 6 to 10 pm, while node 1 used would hit only 30% at peak period. Now we've seen an influx of new customers and this could explain why the sudden load during that period. Moreover, it looks like we need to rewrite some of our code to reduce CPU usage. However, what confounds me is why node 1 isn't taking on as much load as the other nodes. I'm thinking maybe the following events / situations caused it. Hope somebody here can confirm or point me in the right direction. 1. Expanded Tree Cache Increase / Restart / Seg Fault / Forest Failover The night before the CPU usage spike, I had to increase the Expanded Tree Cache for the cluster from 8 GB to 12 GB (i.e. 12288 MB). This of course caused the ML cluster to automatically restart. After the restart, everything looked ok from the application perspective. However, two hours later, node 1 suddenly encountered multiple "XDMP-OLDSTAMP : Timestamp too old for forest" and "Segmentation fault" errors and caused multiple restarts. Eventually, the forests on node 1 did a local-disk failover to node 2. The following day, we decided to "unfailover" the forest on node 2 back to node 1. The database status shows everything's back to normal after that, except of course the uneven load between the nodes. Questions : a. Could the forest "unfailover" have missed to tell the cluster that node 1 is back in business, thus the uneven load ? b. Could it be that design the database status showing that the "unfailover" was successful, node 2 is still serving the content that failed over to it ? c. Could the 12 GB (12288 MB) expanded tree cache be an "uneven" number causing the multiple restarts and old timestamp and segmentation fault errors ? d. Could the 12 GB expanded tree cache be causing the uneven load across the 3 nodes ? The expanded tree cache partitions is set to 4. 2. Ingestion of content done only on nodes 2 and 3. By design, we validate incoming content on node 1 and ingest (i.e. document-insert) them on nodes 2 and 3. Could this be causing the content to be saved only on nodes 2 and 3 ? I was informed months ago that ML automatically saves documents evenly across all the forests making up a database. In our case, our production database is made up of 3 forests, each saved in each of the nodes. Looking at the database status, it looks like the forests containing binary files have almost the same size, but the text forests have varying sizes as follows : text-content-1 : 51 GB text-conetnt-2 : 74 GB text-content-3 : 66 GB Regards, Danny
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
