Got it, thanks. I know you worked your way around it by reducing the size of the files. Another approach that might work is to load the CSV files as-is and do the splitting as a first step in the processing pipeline. There's a little housekeeping you'd need to do on the resulting documents to make sure they get propagated into the target database (see http://blog.davidcassel.net/2011/06/splitting-data-with-info-studio/).
--Colleen ________________________________________ From: [email protected] [[email protected]] On Behalf Of Steiner, David J. (LNG-DAY) [[email protected]] Sent: Thursday, October 25, 2012 5:27 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] info studio using CPU Colleen, it didn't. :-( I'll send you a separate note... David -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Colleen Whitney Sent: Wednesday, October 24, 2012 12:22 PM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] info studio using CPU David, Thanks, it would be good to know if reducing the number of docs per transaction solves the problem. If not, I can file a bug on your behalf if you like, but I think it might make more sense for you to open a support ticket so that engineering staff can reproduce and address the problem systematically. --Colleen ________________________________________ From: [email protected] [[email protected]] On Behalf Of Steiner, David J. (LNG-DAY) [[email protected]] Sent: Wednesday, October 24, 2012 9:24 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] info studio using CPU Colleen, OK. I will try changing the # of docs setting once I clean up the latest errors. As a note, I think there is an issue when one is using a collector like a CSV collector that makes a bigger doc from the input - perhaps bigger than 64MB. I'm getting these messages in the ErrorLog.txt, but no error is appearing in Info Studio. The process in Info Studio just keeps running. I have to remove the "ticket" docs in App Services just to get control of the Flow back. 2012-10-24 09:29:49.285 Notice: TaskServer: XDMP-EXPNTREECACHEFULL: fn:doc("/14974109146499330104/13438170693114125278//csv/filename_84.x...") -- Expanded tree cache full on host ilabsmltest.legal.regn.net 2012-10-24 09:29:49.285 Notice: TaskServer: $e = <error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error"><error:code>XDMP-EXPNTREECACHEFULL</error:code><error:name/><err...</error:error> 2012-10-24 09:29:50.019 Notice: TaskServer: XDMP-EXPNTREECACHEFULL: fn:doc("/14974109146499330104/13438170693114125278//csv/filename_85.x...") -- Expanded tree cache full on host ilabsmltest.legal.regn.net 2012-10-24 09:29:50.019 Notice: TaskServer: $e = <error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error"><error:code>XDMP-EXPNTREECACHEFULL</error:code><error:name/><err...</error:error> 2012-10-24 10:08:18.411 Notice: TaskServer: XDMP-EXPNTREECACHEFULL: fn:doc("/14974109146499330104/13438170693114125278//csv/filename_91.x...") -- Expanded tree cache full on host ilabsmltest.legal.regn.net 2012-10-24 10:08:18.411 Notice: TaskServer: $e = <error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error"><error:code>XDMP-EXPNTREECACHEFULL</error:code><error:name/><err...</error:error> 2012-10-24 10:08:20.820 Notice: TaskServer: XDMP-EXPNTREECACHEFULL: fn:doc("/14974109146499330104/13438170693114125278//csv/filename_79.x...") -- Expanded tree cache full on host ilabsmltest.legal.regn.net 2012-10-24 10:08:20.820 Notice: TaskServer: $e = <error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:error="http://marklogic.com/xdmp/error"><error:code>XDMP-EXPNTREECACHEFULL</error:code><error:name/><err...</error:error> David -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Colleen Whitney Sent: Wednesday, October 24, 2012 9:51 AM To: MarkLogic Developer Discussion Cc: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] info studio using CPU David, this might sound counter-intuitive, but if you set the number of documents per transaction to something small (5, or even 1), you should be able to avoid the tree cache full error, and just point and press start instead of fiddling with directories. I think it's trying to work with too many large documents in memory at once. Sent from my iPhone On Oct 24, 2012, at 6:20 AM, "Steiner, David J. (LNG-DAY)" <[email protected]> wrote: > Hi Geert, > > Thanks, but this really doesn't do anything for my current problem, which is > that I have 814 50MB files of CSV data to process. At the moment, if I put > too many files in the load directory, I get 'expanded tree cache' errors and > the info studio process seems to be left in a state of un-able to complete - > hitting the stop button does nothing. Apparently, with 20 files, I don't get > the error, while with 25 I do. Incidentally, I actually have to clean up all > of the "ticket" stuff from the App Services DB just to get the Flow to be > usable again). > > Processing 20 files at a time is a little less than optimal, since I'd like > to just point at the directory with 814 files and let it go until it is done. > > The collector and transformer are doing what I want (collector transforms CSV > to XML and transform reads CSV-XML and sticks a naked property into an > appropriate DB for every row in the CSV-XML, then at the end, the CSV-XML > document is written into the DB specified in info studio. I don't > particularly think it will go faster if I write out my naked properties to > the Fab DB and let info studio move them to the DB specified in the info > studio setting (and actually, even if info studio would do that, I'd have to > instead write the XML CSV documents to some other DB because their structure > is different from the naked properties DB). > > Thanks, > David > > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Geert > Josten > Sent: Wednesday, October 24, 2012 3:48 AM > To: MarkLogic Developer Discussion > Subject: Re: [MarkLogic Dev General] info studio using CPU > > Hi David, > > Thought you might be interested in this blog item (and its comments)... > > http://blog.davidcassel.net/2011/06/splitting-data-with-info-studio/ > > Kind regards, > Geert > >> -----Oorspronkelijk bericht----- >> Van: [email protected] [mailto:general- >> [email protected]] Namens Steiner, David J. (LNG-DAY) >> Verzonden: dinsdag 23 oktober 2012 17:30 >> Aan: MarkLogic Developer Discussion >> Onderwerp: Re: [MarkLogic Dev General] info studio using CPU >> >> Doesn't appear that the OS is swapping. >> >> It appears that there are 16 task server threads. >> >> Upon further "watching", it appears that just the collector may not > utilize >> threads? It appears that once the transforming starts, all CPUs >> become engaged. >> >> David >> >> -----Original Message----- >> From: [email protected] [mailto:general- >> [email protected]] On Behalf Of Michael Blakeley >> Sent: Tuesday, October 23, 2012 11:23 AM >> To: MarkLogic Developer Discussion >> Cc: MarkLogic Developer Discussion >> Subject: Re: [MarkLogic Dev General] info studio using CPU >> >> Check the OS metrics. If RAM is maxed out, does that mean the OS is > swapping? >> If so, it's the swap disk that is the bottleneck. >> >> If you can't find an OS bottleneck... How many task server threads >> are configured? I think the default is 4. Adding more threads won't >> help if > the system >> is swapping or otherwise at its limits though. >> >> -- Mike >> >> On Oct 23, 2012, at 7:55, "Steiner, David J. (LNG-DAY)" >> <[email protected]> wrote: >> >>> Using ML 6.0-1.1. >>> >>> In Information Studio, I'm using a CSV collector, to process >>> hundreds > of CSV >> files. I'm also doing a transform to pull each row out of the CSV >> and > write it as >> an individual document into another DB (actually, a naked property, >> but > I don't >> think that matters). >>> >>> The files are all under 50MB (wasn't sure if that 64MB limit still > existed). >>> >>> It seems like only one CPU is being used and we have 8 available. >>> RAM > (24GB) >> is maxed out. It took 72 minutes to process 20 files. >>> >>> Is Info Studio specifically not utilizing more CPU because all of >>> the > RAM is >> already being used? >>> >>> Ideally, I guess, I'd like for Info Studio to be able to take > advantage of all CPUs >> while ingesting. I'm thinking the ingestion where CSV is being > translated to XML >> is the intense part. The "splitting" out and "document" (property) > insert >> shouldn't be as intense? >>> >>> Thanks, >>> David >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >>> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
