Hi Val, I don¹t think you need to run a CAS-PGE task to call crawler_launcher. If you define blocks in the <output>..</output> section of the XML file, a crawler will be forked in the job working directory of CAS-PGE and crawl your specified output.
I believe that will accomplish the same goal of what you are looking for. No need to have crawling be a separate task from CAS-PGE - CAS-PGE will do the crawling for you! :) Cheers, Chris ------------------------ Chris Mattmann [email protected] -----Original Message----- From: "Verma, Rishi (398J)" <[email protected]> Reply-To: <[email protected]> Date: Thursday, October 9, 2014 at 2:44 AM To: "[email protected]" <[email protected]> Subject: Re: what is batch stub? Is it necessary? >Hi Val, > >Yep - here¹s a link to the tasks.xml file: >https://github.com/riverma/xdata-jpl-netscan/blob/master/oodt-netscan/work >flow/src/main/resources/policy/tasks.xml > >> The problem is that the ExternScriptTaskInstance is unable to recognize >>the command line arguments that I want to pass to the crawler_launcher >>script. > > >Hmm.. could you share your workflow manager log, or better yet, the >batch_stub output? Curious to see what error is thrown. > >Is a script file being generated for your PGE? For example, inside your >[PGE_HOME] directory, and within the particular job directory created for >your execution of a workflow, you will see some files starting with >³sciPgeExeScript_². You¹ll find one for your pgeConfig, and you can >check to see what the PGE commands actually translate into, with respect >to a shell script format. If that file is there, take a look at it, and >validate whether the command works within the script (i.e. copy/paste and >run the crawler command manually). > >Another suggestion is to take a step back, and build up slowly, i.e.: >1. Do an ³echo² command within your PGE first. (e.g. <cmd> echo ³Hello >APL.² > /tmp/test.txt</cmd>) >2. If above works, do a crawler_launcher empty command(e.g. ><cmd>/path/to/oodt/crawler/bin/crawler_launcher</cmd>) and verify the >batch_stub or Workflow Manager prints some kind of output when you run >the workflow. >3. Build up your crawler_launcher command piece by piece to see where it >is failing > >Thanks, >Rishi > >On Oct 8, 2014, at 4:24 PM, Mallder, Valerie <[email protected]> >wrote: > >> Hi Rishi, >> >> Thank you very much for pointing me to your working example. This is >>very helpful. My pgeConfig looks very similar to yours. So, I >>commented out the resource manager like you suggested and tried running >>again without the resource manager. And my problem still exists. The >>problem is that the ExternScriptTaskInstance is unable to recognize the >>command line arguments that I want to pass to the crawler_launcher >>script. Could you send me a link to your tasks.xml file? I'm curious as >>to how you defined your task. My pgeConfig and tasks.xml are below. >> >> Thanks! >> Val >> >> >> <?xml version="1.0" encoding="UTF-8"?> >> <pgeConfig> >> >> <!-- How to run the PGE --> >> <exe dir="[JobDir]" shell="/bin/sh" envReplace="true"> >> <cmd>[CRAWLER_HOME]/bin/crawler_launcher --operation >>--launchAutoCrawler \ >> --filemgrUrl [FILEMGR_URL] \ >> --clientTransferer >>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \ >> --productPath [JobInputDir] \ >> --mimeExtractorRepo >>[OODT_HOME]/extensions/policy/mime-extractor-map.xml \ >> --actionIds MoveFileToLevel0Dir</cmd> >> </exe> >> >> <!-- Files to ingest --> >> <output/> >> </output> >> >> <!-- Custom metadata to add to output files --> >> <customMetadata> >> <metadata key="JobDir" val="[OODT_HOME]"/> >> <metadata key="JobInputDir" val="[FEI_DROP_DIR]"/> >> <metadata key="JobOutputDir" val="[JobDir]/data/pge/jobs"/> >> <metadata key="JobLogDir" val="[JobDir]/data/pge/logs"/> >> </customMetadata> >> >> </pgeConfig> >> >> >> >> <!-- tasks.xml **************************************************--> >> >> <cas:tasks xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas"> >> >> <task id="urn:oodt:crawlerLauncherId" name="crawlerLauncherName" >>class="org.apache.oodt.cas.workflow.examples.ExternScriptTaskInstance"> >> <conditions/> <!-- There are no pre execution conditions right >>now --> >> <configuration> >> >> <property name="ShellType" value="/bin/sh" /> >> <property name="PathToScript" >>value="[CRAWLER_HOME]/bin/crawler_launcher" envReplace="true" /> >> >> <property name="PGETask_Name" value="crawler_launcher PGE >>Task"/> >> <property name="PGETask_ConfigFilePath" >>value="[OODT_HOME]/extensions/config/crawler-pge-config.xml" >>envReplace="true" /> >> </configuration> >> </task> >> >> </cas:tasks> >> >> Valerie A. Mallder >> New Horizons Deputy Mission System Engineer >> Johns Hopkins University/Applied Physics Laboratory >> >> >>> -----Original Message----- >>> From: Verma, Rishi (398J) [mailto:[email protected]] >>> Sent: Wednesday, October 08, 2014 6:01 PM >>> To: [email protected] >>> Subject: Re: what is batch stub? Is it necessary? >>> >>> Hi Valerie, >>> >>>>>>> All I am trying to do is run "crawler_launcher" as a workflow task >>>>>>> in the CAS PGE environment. >>> >>> Interesting. I have a working example here [1] you can look at that >>>does this exact >>> thing. >>> >>>>>>> So, if "batchstub" is necessary in this scenario, pleast tell me >>>>>>> what it is, why it is necessary, and how to run it (please provide >>>>>>> exact syntax to put in my startup shell script, because I would >>>>>>> never be able to figure it out for myself and I don't want to have >>>>>>> to bother everyone again.) >>> >>> Batchstub is only necessary if your Workflow Manger is sending jobs to >>>Resource >>> Manager for execution (where the default execution is to run the job >>>in something >>> called a ?batch stub? executable). Think of batch stubs as a small >>>wrapper >>> program that takes a bundle of executable instructions from Resource >>>Manager, >>> and executes them in a shell environment within a given remote (or >>>local) machine. >>> >>> Here?s my suggestion: >>> 1. Like Paul suggested, go to $OODT_HOME/resmgr/bin, and execute the >>> following command (it?ll start a batch stub in a terminal on port >>>2001): >>>> ./batch_stub 2001 >>> >>> If the above step doesn?t fix your problem, you can also try having >>>Workflow >>> Manager NOT send jobs to Resource Manager for execution, and instead >>>execute >>> jobs locally through Workflow Manager itself (on localhost only!). To >>>disable job >>> transfer to Resource Manger, you?ll need to modify the Workflow Manager >>> properties file ($OODT_HOME/wmgr/etc/workflow.properties), and >>>specifically >>> comment out the ?org.apache.oodt.cas.workflow.engine.resourcemgr.url? >>>line. >>> I?ve done this in my example code below, see [2] for an exact example >>>of this. >>> After modifying workflow.properties, make sure to restart workflow >>>manager >>> ($OODT_HOME/wmgr/bin/wmgr stop followed by $OODT_HOME/wmgr/bin/wmgr >>> start). >>> >>> Thanks, >>> Rishi >>> >>> [1] https://github.com/riverma/xdata-jpl-netscan/blob/master/oodt- >>> >>>netscan/pge/src/main/resources/policy/netscan-getipv4entriesrandomsample >>>.xml >>> [2] https://github.com/riverma/xdata-jpl-netscan/blob/master/oodt- >>> netscan/workflow/src/main/resources/etc/workflow.properties >>> >>> On Oct 8, 2014, at 2:31 PM, Ramirez, Paul M (398J) >>> <[email protected]> wrote: >>> >>>> Valerie, >>>> >>>> I would have thought it would have just not used a batch stub by >>>>default. That >>> said if you go into the $OODT_HOME/resmgr/bin there should be a script >>>to start a >>> batch stub. Right now on my phone I forget the name of the script but >>>if you more >>> the file you will see the Java class name that corresponds to below. >>>You should >>> specify a port when you run the script which from the looks of the >>>output below >>> should be 2001. >>>> >>>> HTH, >>>> Paul R >>>> >>>> Sent from my iPhone >>>> >>>>> On Oct 8, 2014, at 2:04 PM, Mallder, Valerie >>>>><[email protected]> >>> wrote: >>>>> >>>>> Well then, I'm proud to be a member :) (I think .... ) >>>>> >>>>> >>>>> Valerie A. Mallder >>>>> New Horizons Deputy Mission System Engineer Johns Hopkins >>>>> University/Applied Physics Laboratory >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Bruce Barkstrom [mailto:[email protected]] >>>>>> Sent: Wednesday, October 08, 2014 4:54 PM >>>>>> To: [email protected] >>>>>> Subject: Re: what is batch stub? Is it necessary? >>>>>> >>>>>> You have every right to bother everyone. >>>>>> You won't get what you need unless you do. >>>>>> >>>>>> You get one honorary membership in the Society of General Agitators >>>>>> - at the rank of Major Agitator. >>>>>> >>>>>> Bruce B. >>>>>> >>>>>> On Wed, Oct 8, 2014 at 4:49 PM, Mallder, Valerie >>>>>> <[email protected] >>>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I am still having trouble getting my CAS PGE crawler task to run >>>>>>> due to >>>>>>> http://localhost:2001 being "down". I have spent the last 2 days >>>>>>> tracing through the resource manager code and tracked this down to >>>>>>> line 146 of LRUScheduler where the XmlRpcBatchMgr is failing to >>>>>>> execute the task remotely, because on line 75 of >>>>>>> XmlRpcBatchMgrProxy (that was instantiated by XmlRpcBatchMgr on its >>>>>>> line 74) is trying to call "isAlive" on the webservice named >>>>>>> "batchstub" which, to my knowledge, is not running because I have >>>>>>>not done >>> anything explicitly to run it. >>>>>>> >>>>>>> All I am trying to do is run "crawler_launcher" as a workflow task >>>>>>> in the CAS PGE environment. I had it running perfectly before I >>>>>>> started trying to make it run as part of a workflow. I really miss >>>>>>> my crawler and really want it to run again L >>>>>>> >>>>>>> So, if "batchstub" is necessary in this scenario, pleast tell me >>>>>>> what it is, why it is necessary, and how to run it (please provide >>>>>>> exact syntax to put in my startup shell script, because I would >>>>>>> never be able to figure it out for myself and I don't want to have >>>>>>> to bother everyone again.) >>>>>>> >>>>>>> Thanks so much! >>>>>>> >>>>>>> Val >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Valerie A. Mallder >>>>>>> >>>>>>> New Horizons Deputy Mission System Engineer The Johns Hopkins >>>>>>> University/Applied Physics Laboratory >>>>>>> 11100 Johns Hopkins Rd (MS 23-282), Laurel, MD 20723 >>>>>>> 240-228-7846 (Office) 410-504-2233 (Blackberry) >>>>>>> >>>>>>> >>> >>> --- >>> Rishi Verma >>> NASA Jet Propulsion Laboratory >>> California Institute of Technology >> > >--- >Rishi Verma >NASA Jet Propulsion Laboratory >California Institute of Technology >
