Re: Regarding Assignment 2

Christian Alan Mattmann Mon, 03 Nov 2014 18:33:07 -0800

Hi Gautham,

Please try calling FileTokenNameMetExtractor from the Crawler
(MetExtractorProductCrawler).


Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Adjunct Associate Professor, Computer Science Department
University of Southern California
Los Angeles, CA 90089 USA
Email: [email protected]
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Gautham Gowrishankar <[email protected]>
Date: Monday, November 3, 2014 at 6:11 PM
To: Chris Mattmann <[email protected]>
Subject: Re: Regarding Assignment 2

>Hello Professor,
>
>
>I was able to figure out how to configure the config.xml file for
>FileTokenNameMetExtractor,so how should i include the config path.
>
>
>Should this be set in filemanager.properties file or as cas-external file
>extracotor as below
>
>
>Inside filemanger.properties as
>
>
><extractor  class=org......FileNameTokenMetExtractor>
>
>
></extractor>
>
>
>
>
>or
>========================================================
><?xml version="1.0" encoding="UTF-8"?>
><cas:externextractor xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas";>
>  <exec workingDir="">
>    <extractorBinPath envReplace="true">[PWD]/extractor</extractorBinPath>
>      <args>
>         <arg isDataFile="true"/>
>         <arg isPath="true">/usr/local/etc/testExtractor.config</arg>
>      </args>
>   </exec>
></cas:externextractor>
>
>
>
>
>Also let us know how the path for the config.xml file should be
>configured.
>
>
>
>
>Regards
>Gautham
>
>
>
>
>On Sun, Nov 2, 2014 at 9:27 PM, Christian Alan Mattmann
><[email protected]> wrote:
>
>Yes a blog and better yet a wiki post on the OODT wiki
>would be much appreciated! :-)
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Adjunct Associate Professor, Computer Science Department
>University of Southern California
>Los Angeles, CA 90089 USA
>Email: [email protected]
>WWW: http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>-----Original Message-----
>From: Gautham Gowrishankar <[email protected]>
>Date: Sunday, November 2, 2014 at 9:51 PM
>To: Chris Mattmann <[email protected]>
>Subject: Re: Regarding Assignment 2
>
>>Professor,
>>
>>I would look into that right now.
>>
>>
>>I would probably write a blog on this and send it to you .
>>There is so much of information and yet is very hard to find it in at a
>>single point w.r.t OODTthat is what makes it so hard :)
>>
>>
>>Regards
>>Gautham
>>
>>
>>
>>On Sun, Nov 2, 2014 at 8:42 PM, Christian Alan Mattmann
>><[email protected]> wrote:
>>
>>Check out src/main/resources/examples in the metadata folder..
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California
>>Los Angeles, CA 90089 USA
>>Email: [email protected]
>>WWW: http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>-----Original Message-----
>>From: Gautham Gowrishankar <[email protected]>
>>Date: Sunday, November 2, 2014 at 4:39 PM
>>To: Chris Mattmann <[email protected]>
>>Subject: Re: Regarding Assignment 2
>>
>>>Hello Professor,
>>>
>>>
>>>I was trying to configure my FileTokenMetExtractor, should it be
>>>configured as a external metadat extractor,which i dont think so.
>>>
>>>
>>>
>>>
>>><?xml version="1.0" encoding="UTF-8"?>
>>><cas:externextractor xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas";>
>>>  <exec workingDir="">
>>>    <extractorBinPath
>>>envReplace="true">[PWD]/extractor</extractorBinPath>
>>>      <args>
>>>         <arg isDataFile="true"/>
>>>         <arg isPath="true">/usr/local/etc/testExtractor.config</arg>
>>>      </args>
>>>   </exec>
>>></cas:externextractor>
>>>Could you provide a link where it is shown as a example how to configure
>>>new Extractors and what argument names should the file be sent.
>>>RegardsGautham
>>>w
>>>
>>>
>>>On Sun, Nov 2, 2014 at 9:41 AM, Christian Alan Mattmann
>>><[email protected]> wrote:
>>>
>>>Hi Gautham,
>>>
>>>Thanks and sorry that it’s difficult. Yes, it’s one of the
>>>harder ones.
>>>
>>>As for the metadata, don’t worry about getting it perfect,
>>>just get going and then you can easily iterate (that’s the
>>>point of using OODT).
>>>
>>>Spanish doesn’t really matter (it’s per job type and the
>>>spanish fields are equivalent to the English ones). Also
>>>there is a program in ETLlib that may help you (translatejson).
>>>
>>>I told you how to do the InPlaceDataTransfer - use the
>>>data transferer and check the docs in file manager.
>>>
>>>Cheers,
>>>Chris
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California
>>>Los Angeles, CA 90089 USA
>>>Email: [email protected]
>>>WWW: http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Gautham Gowrishankar <[email protected]>
>>>Date: Sunday, November 2, 2014 at 10:26 AM
>>>To: Chris Mattmann <[email protected]>
>>>Subject: Re: Regarding Assignment 2
>>>
>>>>Hello Professor,
>>>>
>>>>
>>>>This is actually a hard assignment trying to figure out what actually
>>>>to
>>>>do  :( .
>>>>
>>>>
>>>>I am actually really trying to think what else can we add to the
>>>>metadata
>>>>already present (language is one thing i can think of at the moment)
>>>>
>>>>
>>>>
>>>>
>>>>Another question is since the Data is in Spanish wont it be
>>>>inconvenient
>>>>to query on such terms and provide unwanted results without performing
>>>>the actual translation.
>>>>
>>>>
>>>>Also will disabling the path for Data Archive in File Manger properties
>>>>be enough to do the in place data ingestion.
>>>>
>>>>
>>>>Regards
>>>>Gautham
>>>>
>>>>
>>>>On Sun, Nov 2, 2014 at 9:16 AM, Christian Alan Mattmann
>>>><[email protected]> wrote:
>>>>
>>>>Hi Gautham,
>>>>
>>>>Answers below:
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Gautham Gowrishankar <[email protected]>
>>>>Date: Sunday, November 2, 2014 at 10:13 AM
>>>>To: Chris Mattmann <[email protected]>
>>>>Subject: Re: Regarding Assignment 2
>>>>
>>>>>Hello Professor,
>>>>>
>>>>>
>>>>>As recommended i have gone through a number of Links apart from the
>>>>>one
>>>>>you suggested and here is the conclusion ihave drawn before i actually
>>>>>start implementing it today :P
>>>>>
>>>>>
>>>>>1 File Manager would extract Metadata----(id would be one  using
>>>>>FileNameTokenMetaData Extractor) .Kindly suggest if i need to use
>>>>>anything else that would be necessary like Copy and Rewrite Extractor.
>>>>
>>>>Yep, and other metadata too.
>>>>
>>>>>
>>>>>
>>>>>2. I guess like you suggested in class  it would be nice just do the
>>>>>above task in place without injesting the actual files Can this be
>>>>>done
>>>>>by disabling the path for the Data Archive in Filemanger properties ?
>>>>
>>>>Use the InPlaceDataTransferer
>>>>
>>>>>
>>>>>
>>>>>3. Shell script to  to do the above task(extract metadata from
>>>>>FileManger
>>>>>by iterating over all the files).
>>>>
>>>>Yep.
>>>>
>>>>>
>>>>>
>>>>>4.Write a CasPge Task to combine the Metadata Extracted with JSON
>>>>>Files
>>>>>and user poster.py to post it into solar.
>>>>
>>>>s/solar/Solr/
>>>>
>>>>Yep.
>>>>
>>>>>
>>>>>
>>>>>5.Start the workflow manger with the above events configured.
>>>>
>>>>Yep.
>>>>
>>>>>
>>>>>
>>>>>6.Pre Configure Solr Schema to recognize the above fields along with
>>>>>Id
>>>>>field
>>>>
>>>>Yep.
>>>>
>>>>>
>>>>>
>>>>>7.Write functional queries to test the above.
>>>>
>>>>Yep.
>>>>
>>>>>===============================================================
>>>>>Kindly suggest if we are missing out on the current tasks planned
>>>>>answer
>>>>>the below questions
>>>>>
>>>>>
>>>>>Any other Metadata Extractor that needs to be used.
>>>>
>>>>Can’t say - up to you on this.
>>>>
>>>>>Hints on Link Analysis example and where it can be done i OODT.
>>>>
>>>>Link Analysis should be a piece of custom code that you implement
>>>>(after
>>>>indexing say in FM, or
>>>>during) in which you use the built up information to construct a
>>>>“linkRank” score before indexing
>>>>in Solr (via CAS-PGE and ETLLib/poster).
>>>>
>>>>>Is Function Queries like Recip() and DateBoosting a good enough trick
>>>>>to
>>>>>do the queries.
>>>>
>>>>These are the types of things to look at for the Content based ranking.
>>>>
>>>>Cheers,
>>>>Chris
>>>>
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>Chris Mattmann, Ph.D.
>>>>Adjunct Associate Professor, Computer Science Department
>>>>University of Southern California
>>>>Los Angeles, CA 90089 USA
>>>>Email: [email protected]
>>>>WWW: http://sunset.usc.edu/~mattmann/
>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On Sat, Nov 1, 2014 at 2:41 PM, Christian Alan Mattmann
>>>>><[email protected]> wrote:
>>>>>
>>>>>Already done.
>>>>>
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>Chris Mattmann, Ph.D.
>>>>>Adjunct Associate Professor, Computer Science Department
>>>>>University of Southern California
>>>>>Los Angeles, CA 90089 USA
>>>>>Email: [email protected]
>>>>>WWW: http://sunset.usc.edu/~mattmann/
>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>-----Original Message-----
>>>>>From: Gautham Gowrishankar <[email protected]>
>>>>>Date: Saturday, November 1, 2014 at 11:02 AM
>>>>>To: Chris Mattmann <[email protected]>
>>>>>Subject: Re: Regarding Assignment 2
>>>>>
>>>>>>Hello Professor,
>>>>>>
>>>>>>
>>>>>>Kindly reply to my earlier mail .
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Regards
>>>>>>Gautham
>>>>>>
>>>>>>
>>>>>>On Fri, Oct 31, 2014 at 5:05 PM, Gautham Gowrishankar
>>>>>><[email protected]> wrote:
>>>>>>
>>>>>>Hello Professor,
>>>>>>
>>>>>>
>>>>>>Looking at the queries you have asked we have derived  that
>>>>>>
>>>>>>
>>>>>>Only certain fields of the JSON dataset would be required to be
>>>>>>extracted
>>>>>>like
>>>>>>Posted Date
>>>>>>Title
>>>>>>Start
>>>>>>Duration
>>>>>>Job Type
>>>>>>Company
>>>>>>
>>>>>>Fist Seen Date
>>>>>>Location
>>>>>>Last Seen
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Query 1
>>>>>>Predict which geospatial areas will have which job types in the
>>>>>>future.
>>>>>>====================================
>>>>>>Arrange by Descdening Dates with Count for Each Job Type and provide
>>>>>>proper weights tor rank them.
>>>>>>
>>>>>>
>>>>>>Query 2
>>>>>>Compare jobs in terms of quickly they’re filled specifically in
>>>>>>regards
>>>>>>to region
>>>>>>=====================================
>>>>>>For given each region provide stat of comparison b/w the diff=(first
>>>>>>seen
>>>>>>date - last seen date) for each Job
>>>>>>
>>>>>>
>>>>>>Query 3
>>>>>>Can you classify and zone cities based on the jobs data (E.G.
>>>>>>commercial
>>>>>>shopping region, industrial, residential, business offices, medical,
>>>>>>etc)?
>>>>>>=====================================
>>>>>>
>>>>>>
>>>>>>Query 4
>>>>>>What are the trends as it relates to full time vs part time
>>>>>>employment
>>>>>>in
>>>>>>South America?
>>>>>>======================================
>>>>>>For each Time Interval -----compare Part vs Full Time (Job Type) stat
>>>>>>according to the Location
>>>>>>
>>>>>>
>>>>>>
>>>>>>Kindly answer the below question on the above conclusion drawn
>>>>>>1.Do we need extract only the above fields stated as metadata from
>>>>>>the
>>>>>>JSON
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>>>>dataset.in <http://dataset.in> <http://dataset.in>
>>>>>><http://dataset.in>
>>>>>><http://dataset.in> <http://dataset.in>
>>>>>><http://dataset.in>
>>>>>>case we need to
>>>>>>extract only certain
>>>>>>fields should this be done through script or Java pg and where can we
>>>>>>find necessary material.
>>>>>>
>>>>>>
>>>>>>2.Kindly point us to some material where we can find a way to injest
>>>>>>our
>>>>>>algorithms(Ranking) into Solr
>>>>>>
>>>>>>
>>>>>>3.Give us hints as to where we need to look for Querying through
>>>>>>Solr.
>>>>>>
>>>>>>
>>>>>>Regards
>>>>>>Gautham
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Mon, Oct 27, 2014 at 6:31 PM, Gautham Gowrishankar
>>>>>><[email protected]> wrote:
>>>>>>
>>>>>>Hi Prof,
>>>>>>Below are ResourceManger stub Logs and attached is the status seen on
>>>>>>GUI
>>>>>>============================================================
>>>>>>java.lang.Exception: batchstub.executeJob returned false
>>>>>>        at
>>>>>>org.apache.oodt.cas.resource.batchmgr.XmlRpcBatchMgrProxy.run(XmlRpcB
>>>>>>a
>>>>>>t
>>>>>>c
>>>>>>h
>>>>>>M
>>>>>>grProxy.java:125)
>>>>>>
>>>>>>
>>>>>>and below is the WorkFlow Manger Logs
>>>>>>---------------------------------------------------------------------
>>>>>>-
>>>>>>-
>>>>>>-
>>>>>>-
>>>>>>-
>>>>>>-------------------------------
>>>>>>
>>>>>>FINEST: [{job.queueName=high,
>>>>>>job.instanceClassName=org.apache.oodt.cas.workflow.structs.TaskJob,
>>>>>>job.name <http://job.name> <http://job.name> <http://job.name>
>>>>>><http://job.name>
>>>>>><http://job.name>
>>>>>><http://job.name>=urn:oodt:FileConcatenator,
>>>
>>>
>>>>>>job.id <http://job.id> <http://job.id> <http://job.id>
>>>>>><http://job.id> <http://job.id>
>>>>>><http://job.id>=, job.status=,
>>>>>>job.load=2,
>>>>>>job.inputClassName=org.apache.oodt.cas.workflow.structs.TaskJobInput}
>>>>>>,
>>>>>>{task.instance.class=org.apache.oodt.pge.examples.fileconcatenator.Fi
>>>>>>l
>>>>>>e
>>>>>>C
>>>>>>o
>>>>>>n
>>>>>>catenatorPGETask,
>>>>>>task.config={PGETask_ConfigFilePath=null/file_concatenator/pge-config
>>>>>>s
>>>>>>/
>>>>>>P
>>>>>>G
>>>>>>E
>>>>>>Config.xml,
>>>>>>
>>>>>>PCS_ClientTransferServiceFactory=org.apache.oodt.cas.filemgr.datatran
>>>>>>s
>>>>>>f
>>>>>>e
>>>>>>r
>>>>>>.
>>>>>>LocalDataTransferFactory,
>>>>>>PCS_ActionRepoFile=file:/Users/Adarsh/oodt-deploy/crawler/policy/craw
>>>>>>l
>>>>>>e
>>>>>>r
>>>>>>-
>>>>>>c
>>>>>>onfig.xml, PCS_MetFileExtension=met, PGETask_DumpMetadata=true,
>>>>>>PCS_WorkflowManagerUrl=http://localhost:9200,
>>>>>> PCS_FileManagerUrl=http://localhost:9000,
>>>>>>PGETask_Name=FileConcatenator},
>>>>>>task.metadata={TaskId=[urn:oodt:FileConcatenator],
>>>>>>WorkflowManagerUrl=[http://Adarshs-MacBook-Pro.local:9200],
>>>>>> JobId=[a551fd81-5e3c-11e4-b229-73fd473a7137], RunID=[testNumber2],
>>>>>>ProcessingNode=[Adarshs-MacBook-Pro.local],
>>>>>>WorkflowInstId=[a551fd81-5e3c-11e4-b229-73fd473a7137]}}]
>>>>>>Kindly let us know since the exception is a single line and we are
>>>>>>not
>>>>>>able to figure out the source.
>>>>>>Regards
>>>>>>Gautham
>>>>>>
>>>>>>
>>>>>>On Mon, Oct 27, 2014 at 4:11 PM, Gautham Gowrishankar
>>>>>><[email protected]> wrote:
>>>>>>
>>>>>>Hi Prof,
>>>>>>
>>>>>>
>>>>>>We were trying out the CasPGE learn by example tutorial,but after
>>>>>>starting out the workflow it has taken 15 mins and only 33% has been
>>>>>>completed.
>>>>>>
>>>>>>
>>>>>>Attached is the screenshot,Kindly let us know whether we are on the
>>>>>>right
>>>>>>track.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Regards
>>>>>>Gautham
>>>>>>
>>>>>>
>>>>>>On Mon, Oct 27, 2014 at 11:51 AM, Gautham Gowrishankar
>>>>>><[email protected]> wrote:
>>>>>>
>>>>>>Hy Professor,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>I have issues running the ./querytool  the following lines are what
>>>>>>it
>>>>>>seems to be pointing to
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>==================
>>>>>>"$_RUNJAVA" $JAVA_OPTS $OODT_OPTS \
>>>>>>  -Djava.endorsed.dirs=../lib \
>>>>>>  org.apache.oodt.cas.filemgr.tools.QueryTool "$@"
>>>>>>
>>>>>>
>>>>>>Any idea ? what the issue u think
>>>>>>
>>>>>>
>>>>>>Regards
>>>>>>Gautham
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Mon, Oct 20, 2014 at 7:45 PM, Christian Alan Mattmann
>>>>>><[email protected]> wrote:
>>>>>>
>>>>>>Hi Gautham,
>>>>>>
>>>>>>Thanks for your question - one of the main reasons is that
>>>>>>you can keep track of the upstream provenance using OODT
>>>>>>which may or may not aid you in your ranking computation
>>>>>>later on. There are some other things (e.g., automated
>>>>>>benchmarking and so forth) that OODT provides.
>>>>>>
>>>>>>ETLlib also has an easy poster too that I¹d like you guys
>>>>>>to try using that¹s just as easy (if not more) than post.jar.
>>>>>>
>>>>>>Cheers,
>>>>>>Chris
>>>>>>
>>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>Chris Mattmann, Ph.D.
>>>>>>Adjunct Associate Professor, Computer Science Department
>>>>>>University of Southern California
>>>>>>Los Angeles, CA 90089 USA
>>>>>>Email: [email protected]
>>>>>>WWW: http://sunset.usc.edu/~mattmann/
>>>>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Gautham Gowrishankar <[email protected]>
>>>>>>Date: Monday, October 20, 2014 at 5:07 PM
>>>>>>To: Chris Mattmann <[email protected]>
>>>>>>Subject: Regarding Assignment 2
>>>>>>
>>>>>>>Hello Professor,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>I had a question regarding the index,
>>>>>>>Why were we asked to index the JSON files using OODT and ETLib when
>>>>>>>Solr
>>>>>>>has the capability to perform automatic indexing  for eg using the
>>>>>>>post.jar.
>>>>>>>
>>>>>>>
>>>>>>>Regards
>>>>>>>Gautham
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>

Re: Regarding Assignment 2

Reply via email to