[MarkLogic Dev General] Uneven load in 3 node cluster

Miguel Rodríguez González Thu, 07 Feb 2013 01:30:23 -0800

Hi Mike,
Thanks for your feedback and sorry for taking so long to reply, but somehow the 
email did not make it to my mailbox (corporate spam filter I suppose).
I'm copy-pasting your response, so let's hope it does not screw the thread.


The OS is SUSE Linux Enterprise Server 11 SP2 (x86_64) and we did not see any 
hardware related problems on the machine (I/O, network, ...)

We took your advice and increased the log level and indeed we saw quite some 
things to work with.

Forest::insert: SwocUxOnlineContent-06 XDMP-INMMLISTFULL: In-memory list 
storage full; list: table=78%, wordsused=76%, wordsfree=0%, over
head=24%; tree: table=56%, wordsused=53%, wordsfree=47%, overhead=1%

And after that one, several client connection problems (No XDQP session):

2013-02-06 14:40:37.049 Debug: Stopping XDQPClientConnection, 
server=ml-c1-u2.swets.nl /data/everwisedata3/Forests/SwocUxOnlineContent-06/

We did check the documentation and bumped the in-memory-list-size from 512 to 
1024. For around one hour the problems seemed to be solved, although after a 
while the hung messages reappeared.

Now I'm seeing this in the system logs, which I think it may be related (not 
sure if it's a symptom or the root cause):

Jan 31 17:41:46 ml-c1-u3 kernel: [17467686.201893] TCP: Possible SYN flooding 
on port 7999. Sending cookie

7999 is the defined bind port for the 3 machines of our cluster, and after 
checking with netstat only the other 2 are trying to connect on that port.

Any other tips?

Regards,
Miguel

> Which OS is this?
> 
> The "Hung..." messages mean that the OS was not letting MarkLogic do anything 
> for N seconds. Sometimes that means memory is stressed by either swapping or 
> fragmentation. Sometimes it means disk I/O capacity is overloaded. > Hardware 
> problems are also a possibility. Look into these areas.
>
> If you don't have file-log-level=debug already, set that. It's in the group 
> settings in the admin UI. You may see some interesting new information.
>
> The "Hung..." messages fit nicely into the erratic load. If the database on 
> one host is blocked by the OS, the other two hosts will have to wait until it 
> it comes back before advancing timestamps. So any updates will have to wait 
> for that host > to come back. Queries that need results from forests on the 
> blocked host will have to wait, too.
>
> You don't have to worry about the config files differing from host to host 
> within a cluster. The cluster takes care of that.
>
> The CPF setup sounds odd to me. Normally you'd let CPF manage the state, and 
> wouldn't need that scheduled task. I don't see how the scheduled task would 
> reduce load, at least not over the long haul. Maybe that's the idea? You're 
> trying to > maintain insert performance and then run CPF in less busy times?
>
> -- Mike
>
> On 1 Feb 2013, at 05:28 , Miguel Rodríguez González <mrgonzalez at 
> nl.swets.com> wrote:
>
> > Hi all,
> > we are using CPF for post-processing a set of documents, we load via 
> > content-pump into a 3 node cluster (version 6.0-2). 
> > When we do, we do experience an uneven load in one of the servers (it hangs 
> > every now and then, while the other 2 seem to be waiting for more work), 
> > and so far, we did not 
> > manage to get a grip on what could be wrong.
> > 
> > In short, these are the steps we are following:
> > 
> > The process we follow:
> > - the ETL creates the xml files (around 40 million docs).
> > - content-pump pushes the documents into MarkLogic (10 threads with 100 
> > documents per transaction).
> > - a CPF pipeline adds some collections to the uploaded documents.
> > 
> > These are the steps of the CPF pipeline:
> > - Creation or update of a document, changes the document status to 
> > "unprocessed". This is saved in a document property 
> > - A scheduled task picks up batches of 50k documents and changes the state 
> > to processing every 2 minutes (here we spawn 50 batches of 1k documents to 
> > have 50 transactions).
> > * we opted for using a scheduled task insted of relaying solely on CPF, 
> > because the servers were chocking on high volume.
> > - The state change triggers CPF (on-state-change event) and the document 
> > receives its collections after a query. 
> > - Once the collections are set, the status is changed to done.
> > 
> > We did verify that the 3 nodes have the same configuration. To do so, we 
> > checked the following files:
> > 
> > - assignments.xml 
> > - clusters.xml 
> > - databases.xml 
> > - groups.xml 
> > - hosts.xml 
> > - server.xml (it has 2 obvious differences: the host-id and the ssl private 
> > key)
> > 
> > The only difference between the 3 of them is the memory. These are the 
> > specs:
> > - CPU: 2x X5650, 6 cores, in total 12 cores
> > - MEM: 48 GB ( 64 GB on the third one)
> > - DISK: 6x 600 GB 15K in RAID 10 config
> > 
> > Attached to this email there are 6 pictures, which clearly show the problem 
> > we are facing:
> > - System load (system load, 5 minutes) for each of the 3 nodes
> > - CPU usage on a 100% scale, again for the 3 boxes
> > 
> > On the 3rd machine we see these warnings everytime the CPU is been hog 
> > (Error.log):
> > 
> > 2013-02-01 00:02:01.327 Warning: Hung 65 sec
> > 2013-02-01 00:03:19.243 Warning: Hung 54 sec
> > 2013-02-01 00:04:00.802 Warning: Hung 41 sec
> > 2013-02-01 00:06:40.061 Warning: Hung 130 sec
> > 
> > And some connection lost/time outs on the other 2 machines of the cluster:
> > 
> > 2013-02-01 00:01:08.567 Info: Disconnecting from domestic host 
> > ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> > responded for 30 seconds.
> > 2013-02-01 00:02:54.634 Info: Disconnecting from domestic host 
> > ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> > responded for 30 seconds.
> > 2013-02-01 00:03:50.673 Info: Disconnecting from domestic host 
> > ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> > responded for 30 seconds.
> > 2013-02-01 00:05:01.473 Info: Disconnecting from domestic host 
> > ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not 
> > responded for 30 seconds.
> > 
> > 
> > Could you please provide advice?
> > 
> > Miguel Rodríguez
> > Lead Developer 
> > E mrgonzalez at nl.swets.com
> > I www.swets.com
> >  
> > 
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://developer.marklogic.com/mailman/listinfo/general
> >
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] Uneven load in 3 node cluster

Reply via email to