Hi Mike, Thanks for your feedback and sorry for taking so long to reply, but somehow the email did not make it to my mailbox (corporate spam filter I suppose). I'm copy-pasting your response, so let's hope it does not screw the thread.
The OS is SUSE Linux Enterprise Server 11 SP2 (x86_64) and we did not see any hardware related problems on the machine (I/O, network, ...) We took your advice and increased the log level and indeed we saw quite some things to work with. Forest::insert: SwocUxOnlineContent-06 XDMP-INMMLISTFULL: In-memory list storage full; list: table=78%, wordsused=76%, wordsfree=0%, over head=24%; tree: table=56%, wordsused=53%, wordsfree=47%, overhead=1% And after that one, several client connection problems (No XDQP session): 2013-02-06 14:40:37.049 Debug: Stopping XDQPClientConnection, server=ml-c1-u2.swets.nl /data/everwisedata3/Forests/SwocUxOnlineContent-06/ We did check the documentation and bumped the in-memory-list-size from 512 to 1024. For around one hour the problems seemed to be solved, although after a while the hung messages reappeared. Now I'm seeing this in the system logs, which I think it may be related (not sure if it's a symptom or the root cause): Jan 31 17:41:46 ml-c1-u3 kernel: [17467686.201893] TCP: Possible SYN flooding on port 7999. Sending cookie 7999 is the defined bind port for the 3 machines of our cluster, and after checking with netstat only the other 2 are trying to connect on that port. Any other tips? Regards, Miguel > Which OS is this? > > The "Hung..." messages mean that the OS was not letting MarkLogic do anything > for N seconds. Sometimes that means memory is stressed by either swapping or > fragmentation. Sometimes it means disk I/O capacity is overloaded. > Hardware > problems are also a possibility. Look into these areas. > > If you don't have file-log-level=debug already, set that. It's in the group > settings in the admin UI. You may see some interesting new information. > > The "Hung..." messages fit nicely into the erratic load. If the database on > one host is blocked by the OS, the other two hosts will have to wait until it > it comes back before advancing timestamps. So any updates will have to wait > for that host > to come back. Queries that need results from forests on the > blocked host will have to wait, too. > > You don't have to worry about the config files differing from host to host > within a cluster. The cluster takes care of that. > > The CPF setup sounds odd to me. Normally you'd let CPF manage the state, and > wouldn't need that scheduled task. I don't see how the scheduled task would > reduce load, at least not over the long haul. Maybe that's the idea? You're > trying to > maintain insert performance and then run CPF in less busy times? > > -- Mike > > On 1 Feb 2013, at 05:28 , Miguel Rodríguez González <mrgonzalez at > nl.swets.com> wrote: > > > Hi all, > > we are using CPF for post-processing a set of documents, we load via > > content-pump into a 3 node cluster (version 6.0-2). > > When we do, we do experience an uneven load in one of the servers (it hangs > > every now and then, while the other 2 seem to be waiting for more work), > > and so far, we did not > > manage to get a grip on what could be wrong. > > > > In short, these are the steps we are following: > > > > The process we follow: > > - the ETL creates the xml files (around 40 million docs). > > - content-pump pushes the documents into MarkLogic (10 threads with 100 > > documents per transaction). > > - a CPF pipeline adds some collections to the uploaded documents. > > > > These are the steps of the CPF pipeline: > > - Creation or update of a document, changes the document status to > > "unprocessed". This is saved in a document property > > - A scheduled task picks up batches of 50k documents and changes the state > > to processing every 2 minutes (here we spawn 50 batches of 1k documents to > > have 50 transactions). > > * we opted for using a scheduled task insted of relaying solely on CPF, > > because the servers were chocking on high volume. > > - The state change triggers CPF (on-state-change event) and the document > > receives its collections after a query. > > - Once the collections are set, the status is changed to done. > > > > We did verify that the 3 nodes have the same configuration. To do so, we > > checked the following files: > > > > - assignments.xml > > - clusters.xml > > - databases.xml > > - groups.xml > > - hosts.xml > > - server.xml (it has 2 obvious differences: the host-id and the ssl private > > key) > > > > The only difference between the 3 of them is the memory. These are the > > specs: > > - CPU: 2x X5650, 6 cores, in total 12 cores > > - MEM: 48 GB ( 64 GB on the third one) > > - DISK: 6x 600 GB 15K in RAID 10 config > > > > Attached to this email there are 6 pictures, which clearly show the problem > > we are facing: > > - System load (system load, 5 minutes) for each of the 3 nodes > > - CPU usage on a 100% scale, again for the 3 boxes > > > > On the 3rd machine we see these warnings everytime the CPU is been hog > > (Error.log): > > > > 2013-02-01 00:02:01.327 Warning: Hung 65 sec > > 2013-02-01 00:03:19.243 Warning: Hung 54 sec > > 2013-02-01 00:04:00.802 Warning: Hung 41 sec > > 2013-02-01 00:06:40.061 Warning: Hung 130 sec > > > > And some connection lost/time outs on the other 2 machines of the cluster: > > > > 2013-02-01 00:01:08.567 Info: Disconnecting from domestic host > > ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not > > responded for 30 seconds. > > 2013-02-01 00:02:54.634 Info: Disconnecting from domestic host > > ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not > > responded for 30 seconds. > > 2013-02-01 00:03:50.673 Info: Disconnecting from domestic host > > ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not > > responded for 30 seconds. > > 2013-02-01 00:05:01.473 Info: Disconnecting from domestic host > > ml-c1-u3.swets.nl in cluster 17671928148696225660 because it has not > > responded for 30 seconds. > > > > > > Could you please provide advice? > > > > Miguel Rodríguez > > Lead Developer > > E mrgonzalez at nl.swets.com > > I www.swets.com > > > > > > _______________________________________________ > > General mailing list > > General at developer.marklogic.com > > http://developer.marklogic.com/mailman/listinfo/general > > _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
