Got it. Can you increase the heap space on your batch stub? That should take care of it.
Cheers, Chris P.S. Great work! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Zichuan Wang <[email protected]> Date: Wednesday, November 5, 2014 at 11:12 PM To: Chris Mattmann <[email protected]> Cc: Chris Mattmann <[email protected]>, "[email protected]" <[email protected]>, Luke liu <[email protected]>, "[email protected]" <[email protected]>, "[email protected]" <[email protected]> Subject: Re: re: Question about OODT file manager >Dear Professor, > > >I finally figured out how to trigger a post ingest event. However when I >try to crawl the whole dataset, I got an OutOfMemory Error. Could you >please take a look and maybe give some suggestions? > > >➜ bin ./crawler_launcher \ >--operation --launchAutoCrawler \ >--filemgrUrl http://localhost:9000 \ >--clientTransferer >org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \ >--productPath /Users/zichuanwang/Downloads/output \ >--mimeExtractorRepo ../policy/mime-extractor-map.xml \ >--workflowMgrUrl http://localhost:9200 \ >-ais TriggerPostIngestWorkflow >Setting property 'AutoDetectProductCrawler.mimeExtractorRepo' >Setting property 'StdProductCrawler.clientTransferer' >Setting property 'MetExtractorProductCrawler.clientTransferer' >Setting property 'AutoDetectProductCrawler.clientTransferer' >Setting property 'StdProductCrawler.filemgrUrl' >Setting property 'MetExtractorProductCrawler.filemgrUrl' >Setting property 'AutoDetectProductCrawler.filemgrUrl' >Setting property 'TriggerPostIngestWorkflow.workflowMgrUrl' >Setting property 'StdProductCrawler.actionIds' >Setting property 'MetExtractorProductCrawler.actionIds' >Setting property 'AutoDetectProductCrawler.actionIds' >Setting property 'StdProductCrawler.productPath' >Setting property 'MetExtractorProductCrawler.productPath' >Setting property 'AutoDetectProductCrawler.productPath' >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'StdProductCrawler.productPath' set to value >[/Users/zichuanwang/Downloads/output] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'TriggerPostIngestWorkflow.workflowMgrUrl' set to value >[http://localhost:9200] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'AutoDetectProductCrawler.mimeExtractorRepo' set to value >[../policy/mime-extractor-map.xml] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'MetExtractorProductCrawler.clientTransferer' set to value >[org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'AutoDetectProductCrawler.filemgrUrl' set to value >[http://localhost:9000] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'AutoDetectProductCrawler.clientTransferer' set to value >[org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'MetExtractorProductCrawler.actionIds' set to value >[TriggerPostIngestWorkflow] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'StdProductCrawler.actionIds' set to value >[TriggerPostIngestWorkflow] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'StdProductCrawler.filemgrUrl' set to value >[http://localhost:9000] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'AutoDetectProductCrawler.actionIds' set to value >[TriggerPostIngestWorkflow] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'AutoDetectProductCrawler.productPath' set to value >[/Users/zichuanwang/Downloads/output] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'MetExtractorProductCrawler.filemgrUrl' set to value >[http://localhost:9000] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'StdProductCrawler.clientTransferer' set to value >[org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory] >Nov 5, 2014 10:07:47 PM >org.springframework.beans.factory.config.PropertyOverrideConfigurer >processKey >: Property 'MetExtractorProductCrawler.productPath' set to value >[/Users/zichuanwang/Downloads/output] >Nov 5, 2014 10:07:47 PM org.apache.oodt.cas.crawl.ProductCrawler crawl >Ϣ: Crawling /Users/zichuanwang/Downloads/output >Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >at java.io.UnixFileSystem.list(Native Method) >at java.io.File.list(File.java:973) >at java.io.File.listFiles(File.java:1129) >at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:104) >at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75) >at >org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(Craw >lerLauncherCliAction.java:58) >at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331) >at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187) >at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36) > > >— >Zichuan Wang >Department of Computer Science, USC > > >On Wed, Nov 5, 2014 at 6:42 PM, Christian Alan Mattmann ><[email protected]> wrote: > > >Thanks Luke, I’ve given you permissions so you should now see an >“edit” button on that wiki page. > >Cheers, >Chris > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Adjunct Associate Professor, Computer Science Department >University of Southern California >Los Angeles, CA 90089 USA >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > >-----Original Message----- >From: Luke liu <[email protected]> >Date: Wednesday, November 5, 2014 at 6:48 PM >To: Chris Mattmann <[email protected]>, "[email protected]" ><[email protected]> >Cc: Chris Mattmann <[email protected]>, "[email protected]" ><[email protected]>, "[email protected]" <[email protected]>, 'Zichuan Wang' ><[email protected]> >Subject: RE: re: Question about OODT file manager > >>I just signed up on the wiki(i.e. https://cwiki.apache.org ) with the >>following account detail: >> Account name: luke >> Full Name: Shuai Liu (Luke) >> Email: [email protected] >> Password: ******* >> >>But I am not sure where I can add my notes to the following web article >>with >>which I had trouble , I also tried to create a new article, but failed >>to >>do >>it as I cannot find a place where I can edit, does this have something >>do >>with my account that is not visible for the "edit" or "comments" action? >>https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Example >> >> >> >>Thanks >>Luke >>-----Original Message----- >>From: Mattmann, Chris A (3980) [mailto:[email protected]] >>Sent: Sunday, November 2, 2014 6:59 AM >>To: Luke liu; [email protected] >>Cc: 'Christian Alan Mattmann'; [email protected]; [email protected]; >>'Zichuan >>Wang' >>Subject: Re: re: Question about OODT file manager >> >>Yes Luke, making the instructions better would be much appreciated! >> >>If you have an account on the wiki please share it, else sign up for an >>Apache OODT wiki account and please share it with me or anyone else on >>dev@oodt and we’ll add you. >> >>Cheers, >>Chris >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Chris Mattmann, Ph.D. >>Chief Architect >>Instrument Software and Science Data Systems Section (398) NASA Jet >>Propulsion Laboratory Pasadena, CA 91109 USA >>Office: 168-519, Mailstop: 168-527 >>Email: [email protected] >>WWW: http://sunset.usc.edu/~mattmann/ >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Adjunct Associate Professor, Computer Science Department University of >>Southern California, Los Angeles, CA 90089 USA >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >>-----Original Message----- >>From: Luke liu <[email protected]> >>Date: Sunday, November 2, 2014 at 1:32 AM >>To: Chris Mattmann <[email protected]>, >>"[email protected]" >><[email protected]> >>Cc: Chris Mattmann <[email protected]>, "[email protected]" >><[email protected]>, "[email protected]" <[email protected]>, 'Zichuan >>Wang' >><[email protected]> >>Subject: RE: re: Question about OODT file manager >> >>>Thanks Professor Mattmann, not running batch_stub was the main culprit >>>and there were some other issues such as missing jars; and sorry for >>>not confirming this right away, my laptop was actually crashing, and i >>>just had time to fix it; BTW, I was able to get the cas-pge example to >>>work, (even though I saw the workflow failed to pass the pre-condition >>>in the log, the combined file and some metadata files (i.e.3 files) >>>were still successfully ingested and placed in the output directory) >>> >>>BTW, i think there are a lot of mistakes in the documents, do you want >>>us to help correct the document(i.e. >>>https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Examp >>>le >>>)? >>>If possible, I would like to please share my notes with some problem >>>steps mentioned there. >>> >>>Anyway, thanks for your help and appreciated. >>> >>>Thanks >>>Luke >>>-----Original Message----- >>>From: Mattmann, Chris A (3980) [mailto:[email protected]] >>>Sent: Saturday, November 1, 2014 10:48 AM >>>To: Luke; [email protected] >>>Cc: 'Christian Alan Mattmann'; [email protected]; [email protected]; >>>'Zichuan Wang' >>>Subject: Re: re: Question about OODT file manager >>> >>>Dear Luke, just confirming, we solved this in class right? It had to do >>>with the batch stub not being turned on. >>> >>>Cheers, >>>Chris >>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>Chris Mattmann, Ph.D. >>>Chief Architect >>>Instrument Software and Science Data Systems Section (398) NASA Jet >>>Propulsion Laboratory Pasadena, CA 91109 USA >>>Office: 168-519, Mailstop: 168-527 >>>Email: [email protected] >>>WWW: http://sunset.usc.edu/~mattmann/ >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>Adjunct Associate Professor, Computer Science Department University of >>>Southern California, Los Angeles, CA 90089 USA >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>>-----Original Message----- >>>From: Luke <[email protected]> >>>Date: Tuesday, October 28, 2014 at 12:52 PM >>>To: Chris Mattmann <[email protected]>, >>>"[email protected]" >>><[email protected]> >>>Cc: Chris Mattmann <[email protected]>, "[email protected]" >>><[email protected]>, "[email protected]" <[email protected]>, 'Zichuan >>>Wang' >>><[email protected]> >>>Subject: RE: re: Question about OODT file manager >>> >>>>Dear Professor Mattamnn, >>>>Thanks a lot Professor Mattmann for the kind help, it is appreciated, >>>>sorry for getting back to you with my appreciation, I have been >>>>conducting tests with OODT based on your advice, but unfortunately I >>>>am having another problem.... >>>> >>>>I am following the steps >>>>(https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Exa >>>>mpl >>>>e >>>>) to get a sense of how to get workflow to work. >>>>The problem is that the File-Concatenator-PGE (by running the >>>>wmgr-client >>>>command-line) does not seems to be invoked or executed, but I am >>>>seeing the tasks are getting stacked up in the workflow manager with >>>>status either "RSUBMIT" or "QUEUED", but they are not getting >>>>executed, >>PFA: >>>>workflow_monitor.jpg, please note, by default the workflow min pool >>>>size is 6; so here comes another problem, i have 6 submitted tasks >>>>with status RSUBMIT, but any new incoming tasks will be forwarded to >>>>the waiting QUEUE with status "QUEUED"...please refer to the >>>>workflow_monitor.jpg for details, where I have 3 QUEUED workflow task >>>>and >>6 RSUMBITE tasks. >>>> >>>>Question 1): not sure why the workflow is not being executed, and >>>>hanging at the state of "RSUBMIT", after enabling the log level, I am >>>>seeing the following entry in the log, not sure if this has anything >>>>to do with the "hanging" problem where workflow is not getting >>>>executed and hanging at state of "RSUBMIT". >>>> Oct 28, 2014 3:35:07 AM >>>>org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread >>>>safeCheckJobComplete >>>> WARNING: Exception checking completion status for job: >>>>[2014-10-28T01:59:32.813-07:00]: Messsage: java.lang.Exception: >>>>java.lang.NullPointerException >>>> >>>>Question 2): I think currently on my side any new incoming workflow >>>>task I am sending with the following command is being directed to the >>>>waiting "QUEUE" because of the min pool size (i.e. 6) (I can increase >>>>this to a larger number though), >>>> ./wmgr-client --url http://localhost:9200 >>>--operation --sendEvent >>>>--eventName fileconcatenator-pge --metaData --key RunID testNumber1 >>>> If possible, I would like to please know if there is a way we can >>>purge >>>>the queue and get rid of those workflow tasks either in "RSUMBIT" and >>>>"QUEUED" I have already sent, please kindly help. >>>> >>>>Very sorry for troubling you with this, to be honest I find OODT a bit >>>>challenging to grasp within a short time frame, probably because there >>>>is no book like OODT in action like Solr.... and what I am doing is >>>>just trial and error blended with guess, but I don’t want to make a >>>>blind guess, it will be appreciated if you can please also shed some >>>>lights on where I can get more information logging or other way where >>>>I can troubleshoot. I think it might be worth tracking what is >>>>happening when workflow reach the status "RSUBMIT" and how to get a >>>>specific logging info specific to it... >>>> >>>>Again your advice and kind help will be appreciated usual. >>>> >>>> >>>>Thanks >>>>Luke >>>> >>>>> -----Original Message----- >>>>> From: Mattmann, Chris A (3980) >>>>> [mailto:[email protected]] >>>>> Sent: 2014年10月26日 22:18 >>>>> To: Luke; 'Zichuan Wang' >>>>> Cc: 'Christian Alan Mattmann'; [email protected]; [email protected]; >>>>> [email protected] >>>>> Subject: Re: re: Question about OODT file manager >>>>> >>>>> Hi Luke, >>>>> >>>>> Thanks and sorry it’s taken me a while to reply. Here are some >>>>>details >>>>>below: >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Luke <[email protected]> >>>>> Date: Sunday, October 26, 2014 at 6:19 PM >>>>> To: Chris Mattmann <[email protected]>, 'Zichuan Wang' >>>>> <[email protected]> >>>>> Cc: Chris Mattmann <[email protected]>, "[email protected]" >>>>> <[email protected]>, "[email protected]" <[email protected]>, >>>>> "[email protected]" <[email protected]> >>>>> Subject: RE: re: Question about OODT file manager >>>>> >>>>> >Hi Professor Mattmann and OODT DEV, >>>>> > >>>>> >Sorry to trouble you with this email, our team has been struggling >>>>> >in the oodt to send json files to solr. >>>>> >One of the difficulties is still getting OODT workflow to call the >>>>> >poster.py in etllib. >>>>> >>>>> Sorry that you’re having difficulty let me try and help. >>>>> >>>>> > >>>>> >I am not sure if my understanding is correct with OODT requirement, >>>>> >I hope you can please kindly advice and help with our confusion. >>>>> > >>>>> >a set of goals in my mind with OODT is as follows, please kindly >>>>> >confirm and clarify: >>>>> > >>>>> >1) >>>>> >Get the File-Manager up and running. >>>>> >>>>> Yep, hopefully as installed via OODT RADIX. >>>>> >>>>> >2) >>>>> >send all json files with command wmgr-client to the fileManager >>>>>server. >>>>> >(I believe we can achieve it with a bash script or probably python >>>>> >that calls the command line sequentially with each json file name >>>>> >as >>>>>an >>>>> >argument?!) >>>>> >>>>> Suggestion: >>>>> >>>>> 1. Use the OODT crawler and file manager to crawl/index the JSON >>>>>files (in place data transfer). >>>>> 2. Take a look at CAS-PGE, it will help you write a workflow task >>>>>that will wrap ETLlib and the poster command. >>>>> 3. Once you are confident with #2, whip up a script that pages >>>>>through all of your indexed JSON files, and then for each one, >>>>>submits a workflow event (you may need to look into aggregating >>>>>them) that calls your CAS-PGE wrapped poster task from ETLlib. >>>>> >>>>> >3) >>>>> >Once we have json files sent and stored in the File-Manager, we >>>>> >need >>>>>to >>>>> >get workflow-manager up and running, and we can create a workflow >>>>>that >>>>> >send those jsons file from the file manager to solr. >>>>> >>>>> See above. >>>>> >>>>> >4) >>>>> >Create a workflow according to >>>>> >Workflow2 User Guide >>>>> >>>>>><https://cwiki.apache.org/confluence/display/OODT/Workflow2+User+Gui >>>>>>de> >>>>> >>>>>>>>>>> here comes the problem….. >>>>> > I am not sure how to create a workflow task which can call >>>>>the >>>>> >poster.py in python etllib, it looks like we need to create our own >>>>> >java class that extend <TaskInstance> which is an abstract Java >>>>> >class with one abstract method that has the following signature: >>>>> > >>>>> > >>>>> >protectedabstract ResultsState performExecution(ControlMetadata >>>>> >crtlMetadata); >>>>> > However, the detail of where to find the corresponding >>>>> >libs and where to put our implementation in workflow manager is >>>>> >being neglected in that page. I am not sure if we should use >>>>> >TaskInstance, but it seems the workflow has to have an interface >>>>> >thru which it can call the python code i.e. poster.py. and it looks >>>>> >like we need to embody the TaskInstance::performExecution by >>>>> >injecting the code that calls the poster.py and return the >>resultState. >>>>> > >>>>> > >>>>> >It would be greatly appreciated if you could please shed some >>>>> >lights and advice how we can get a task instance to call the >>>>> >poster.py. BTW, >>>>>I >>>>> >am also not sure if my understanding is correct, please kindly >>>>>correct >>>>> >it if inappropriate. Your help will be appreciated as usual. >>>>> > >>>>> > >>>>> > >>>>> >Thanks >>>>> >Luke >>>>> >>>>> Thanks Luke, see above. Let me know if it helps. >>>>> >>>>> Cheers! >>>>> >>>>> Chris >>>>> >>>>> > >>>>> >From: Mattmann, Chris A (3980) >>>>> >[mailto:[email protected]] >>>>> > >>>>> >Sent: 2014年10月25日 >>>>> > 13:34 >>>>> >To: Zichuan Wang >>>>> >Cc: Christian Alan Mattmann; Luke; [email protected]; >>>>> >[email protected] >>>>> >Subject: Re: 回复: Question about OODT file manager >>>>> > >>>>> > >>>>> > >>>>> >Please cc >>>>> >[email protected] <mailto:[email protected]> I will reply in >>>>>detail >>>>> >soon >>>>> > >>>>> >Sent from my iPhone >>>>> >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> ++ >>>>> Chris Mattmann, Ph.D. >>>>> Chief Architect >>>>> Instrument Software and Science Data Systems Section (398) NASA Jet >>>>> Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 168-519, Mailstop: 168-527 >>>>> Email: [email protected] >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> ++ >>>>> Adjunct Associate Professor, Computer Science Department University >>>>> of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> ++ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> > >>>>> > >>>>> >On Oct 25, 2014, at 1:26 PM, "Zichuan Wang" <[email protected]> >>>>>wrote: >>>>> > >>>>> > >>>>> >Dear Professor, >>>>> > >>>>> > >>>>> > >>>>> >Could please also explain how I can crawl all JSON file name under >>>>> >a specific directory using CAS-PGE? I’ll work through this example >>>>> >https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+E >>>>> >xam >>>>> p >>>>> >le, but it doesn’t mention anything about crawling, instead it >>>>> >manually set the Input files paths... >>>>> > >>>>> > >>>>> > >>>>> > >>>>> >-- >>>>> > >>>>> >Zichuan Wang >>>>> > >>>>> >University of Southern California, Department of Computer Science >>>>> > >>>>> > >>>>> > >>>>> > >>>>> >在 2014年10月25日 星期六,下午12:10,Zichuan Wang >>>>> >写道: >>>>> > >>>>> >Dear Professor, >>>>> > >>>>> > >>>>> > >>>>> >In assignment 2 specification I noticed that you mentioned OODT >>>>> >File Manager, but from my understanding, we are using ETLLib poster >>>>> >which talks directly to Solr. So how can we use OODT File Manager >>>>> >in this assignment? >>>>> > >>>>> > >>>>> > >>>>> >-- >>>>> > >>>>> >Zichuan Wang >>>>> > >>>>> >University of Southern California, Department of Computer Science >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> >>> >>> >> >> > > > > > >
