---------- Forwarded message ----------
From: MengYing Wang <mengyingwa...@gmail.com>
Date: Thu, Oct 23, 2014 at 10:00 PM
Subject: Re: Directed Research Weekly Report from 2014/09/29 - 2014/10/05
To: "Verma, Rishi (398M)" <rishi.ve...@jpl.nasa.gov>
Cc: Christian Alan Mattmann <mattm...@usc.edu>, "Mcgibbney, Lewis J (398M)"
<lewis.j.mcgibb...@jpl.nasa.gov>, "Bryant, Ann C (398G-Affiliate)" <
anniebry...@gmail.com>, "Ramirez, Paul M (398M)" <
paul.m.rami...@jpl.nasa.gov>, "Mattmann, Chris A (3980)" <
chris.a.mattm...@jpl.nasa.gov>, Tyler Palsulich <tpalsul...@gmail.com>, "
u...@oodt.apache.org" <u...@oodt.apache.org>


Dear Rishi,

I followed the new steps to use the OODT RADiX. Unfortunately, I got the
same "Profile with id: 'fm-solr-catalog' has not been activated" error.
Below are my commands, and some terminal output. Please check it to see if
I have made some mistakes, or is it possible that something wrong with the
source code? Really appreciate for your help!

Step 1: $svn co http://svn.apache.org/repos/asf/oodt/trunk/ oodt_radix
A    oodt_radix/curator
A    oodt_radix/curator/pom.xml
A    oodt_radix/curator/src
A    oodt_radix/curator/src/test
......
Checked out revision 1633738.

Step 2: $cd oodt_radix/

Step 3: $mvn clean install
[INFO] Scanning for projects...
[INFO] Reactor build order:
[INFO]   OODT Core
[INFO]   Common Utilities
[INFO]   CAS Command Line Interface
......
[INFO] BUILD SUCCESSFUL
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 5 minutes 29 seconds
[INFO] Finished at: Wed Oct 22 20:40:38 PDT 2014
[INFO] Final Memory: 133M/254M

Step 4: $mvn archetype:generate
[INFO] Scanning for projects...
[INFO] Reactor build order:
[INFO]   OODT Core
[INFO]   Common Utilities
......
[INFO] project created from Old (1.x) Archetype in dir:
/Users/AngelaWang/Downloads/oodt_radix/radix-archetype
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 1 minute 40 seconds
[INFO] Finished at: Wed Oct 22 20:52:46 PDT 2014
[INFO] Final Memory: 36M/84M
[INFO]
------------------------------------------------------------------------

Step 5: $cd radix-archetype/
Step 6: $mvn clean package -Pfm-solr-catalog
[INFO] Scanning for projects...
[WARNING]
Profile with id: 'fm-solr-catalog' has not been activated.

[INFO]
------------------------------------------------------------------------
[INFO] Building radix-archetype
[INFO]    task-segment: [clean, package]
[INFO]
------------------------------------------------------------------------
.....

Best,
Mengying (Angela) Wang

On Sat, Oct 18, 2014 at 3:49 PM, Verma, Rishi (398M) <
rishi.ve...@jpl.nasa.gov> wrote:

>  Hi MengYing,
>
>  Your CMD1 should not have the ‘-Pfm-solr-catalog’ argument. The reason
> is because that command *generates* a new project for you, whereas, the
> ‘-Pfm-solr-catalog’ should only be used to *build* the project once it
> has already been generated. You might want to read up a bit on Maven
> archetypes, which is what OODT RADiX is.
> http://maven.apache.org/guides/introduction/introduction-to-archetypes.html
>
>  Let me explain in this way, here’s the steps to using OODT RADiX:
> 1. Get a hold of the latest OODT RADiX Maven Archetype (you might have
> already done this if you have the full OODT source)
>     i.e. download the full OODT source and invoke ‘mvn install’ so that
> you can use the latest RADiX archetype
>     http://svn.apache.org/repos/asf/oodt/trunk/
> 2. Use the OODT RADiX Maven Archetype to *generate* a new OODT project
> source folder structure for you (this is the source for your new project!)
>     i.e. invoke the command:
>     > mvn archetype:generate
> (select RADiX from the list of archetypes you see, and follow the prompts)
> 3. Change into the newly generated directory from above, and *build* a
> tar-ball distribution of OODT that you can run from the source folder
> structure you generated earlier
> > mvn clean package -Pfm-solr-catalog
> 4. Take the build tar-ball distribution, and extract it somewhere else for
> launching OODT
> > tar zxf distribution/target/oodt-*.jar -C /usr/local/my-oodt-project
> 5. Run OODT
> > cd /usr/local/my-oodt-project/bin
> > ./oodt start
>
>  That’s the typical workflow for using RADiX. So the key here is, only
> use the ‘-Pfm-solr-catalog’ argument when *building* OODT, not when
> *generating* the folder structure.
>
>   *If you’re starting from scratch:*
> 1. Use Vagrant Virtual Machine technology to get a pre-built OODT
> deployment connected to Solr in one command:
> https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT
>
>  [ I didn't try this approach ]
>
>
>  You should try this! Because all five steps above are automated for
> you via the Vagrant machine.
>
>  Thanks,
> rishi
>
>  On Oct 17, 2014, at 11:29 AM, MengYing Wang <mengyingwa...@gmail.com>
> wrote:
>
>  Dear Rishi,
>
>  Actually, in the first command of the tutorial
> <https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands>:
>  curl
> -s
> http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/bin/radix
> | bash, the "default" instead of the "fm-solr-catalog" profile is already
> activated. So the Solr component is not built.
>
>  guest-wireless-207-151-035-005:Downloads AngelaWang$ curl -s
> http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/bin/radix
> | bash
> [INFO] Scanning for projects...
> [INFO] Searching repository for plugin with prefix: 'archetype'.
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Building Maven Default Project
> [INFO]    task-segment: [archetype:generate] (aggregator-style)
> [INFO]
> ------------------------------------------------------------------------
> [INFO] Preparing archetype:generate
> [INFO] No goals needed for project - skipping
>  ......
>
>
>  BTW, the "fm-solr-catalog" profile is defined in the filemgr pom.xml
> <http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/filemgr/pom.xml>
> and distribution pom.xml
> <http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/distribution/pom.xml>
> .
>
>  Best,
> Mengying (Angela) Wang
>
> On Thu, Oct 16, 2014 at 7:42 PM, MengYing Wang <mengyingwa...@gmail.com>
> wrote:
>
>> Dear Rishi,
>>
>>  Thank you for your help.
>>
>>  Yes, I am using the OODT 0.7 and also running the 'mvn package
>> -Pfm-solr-catalog' command from the top-level directory.
>>
>>  Following are the commands and logs:
>>
>>  Cmd 1:
>>
>> guest-wireless-207-151-035-005:Downloads AngelaWang$ mvn
>> archetype:generate -Pfm-solr-catalog -DarchetypeGroupId=org.apache.oodt
>> -DarchetypeArtifactId=radix-archetype -DarchetypeVersion=0.6 -Doodt=0.7
>> -DgroupId=com.mycompany -DartifactId=oodt -Dversion=0.1
>>
>> [INFO] Scanning for projects...
>>
>> [WARNING]
>>
>> Profile with id: 'fm-solr-catalog' has not been activated.
>>
>> ......
>>
>> [INFO] BUILD SUCCESSFUL
>>
>> ......
>>
>> Cmd 2:
>>
>>
>>  guest-wireless-207-151-035-005:Downloads AngelaWang$ cd oodt
>>
>> Cmd 3:
>>
>> guest-wireless-207-151-035-005:oodt AngelaWang$ mvn clean package
>> -Pfm-solr-catalog
>>
>> [INFO] Scanning for projects...
>>
>> [WARNING]
>>
>> Profile with id: 'fm-solr-catalog' has not been activated.
>>
>> [INFO] Reactor build order:
>>
>> [INFO]   Data Management System
>>
>>  [INFO]   Extensions
>>
>> ......
>>
>>
>>  Thank you.
>>
>> Best,
>>
>> Mengying Wang
>>
>>
>>
>>
>>
>>
>> On Thu, Oct 16, 2014 at 6:09 PM, Verma, Rishi (398M) <
>> rishi.ve...@jpl.nasa.gov> wrote:
>>
>>> Hey Mengying,
>>>
>>>  That error usually gets thrown if you invoke a Maven build from a
>>> subdirectory not containing the profile definition.
>>>
>>>  Two things to check for:
>>> * Are you calling ‘mvn clean package -Pfm-solr-catalog from the
>>> top-level directory of your RADiX installation? i.e. the directory
>>> containing a pom.xml file and folders like ‘crawler’, ‘distribution’,
>>> ‘extensions’, etc ...
>>> * Are you running an OODT version 0.7+?
>>>
>>>  Thanks,
>>>  rishi
>>>
>>>  On Oct 16, 2014, at 4:45 PM, MengYing Wang <mengyingwa...@gmail.com>
>>> wrote:
>>>
>>>  Dear Rishi,
>>>
>>>  When I try to use the OODT RADiX using the command "mvn clean package
>>> -Pfm-solr-catalog", I get the "profile with id: 'fm-solr-catalog' has
>>> not been activated" error. Have you by any chance seen this error before?
>>> Thank you! Also after the installation, no solr directory is found in my
>>> machine too.
>>>
>>> $ mvn clean package -Pfm-solr-catalog
>>>
>>> [INFO] Scanning for projects...
>>>
>>> [WARNING]
>>>
>>> Profile with id: 'fm-solr-catalog' has not been activated.
>>>
>>> [INFO] Reactor build order:
>>>
>>> [INFO]   Data Management System
>>>
>>> ......
>>>  Best,
>>> Mengying Wang
>>>
>>> On Sun, Oct 5, 2014 at 3:05 PM, Verma, Rishi (398J) <
>>> rishi.ve...@jpl.nasa.gov> wrote:
>>>
>>>> Hi Mengying,
>>>>
>>>>  For integrating OODT File Manager with Solr, you have a couple
>>>> options depending on the type of deployment you are doing and what stage
>>>> your software is at:
>>>>
>>>>  *If you’re starting from scratch:*
>>>> 1. Use Vagrant Virtual Machine technology to get a pre-built OODT
>>>> deployment connected to Solr in one command:
>>>> https://cwiki.apache.org/confluence/display/OODT/Vagrant+Powered+OODT
>>>> 2. Use OODT RADiX for a pre-built deployment directory containing OODT
>>>> File Manager, Workflow, Resource etc and Solr pre-integrated. RADiX allows
>>>> for pre-configured OODT deployments, abstracting you from checking out
>>>> individual OODT modules via source and building them.
>>>>    See:
>>>> https://cwiki.apache.org/confluence/display/OODT/RADiX+Powered+By+OODT#RADiXPoweredByOODT-TheCommands
>>>>    Make sure to build with the command: mvn -Pfm-solr-catalog package
>>>>  (see read me:
>>>> http://svn.apache.org/repos/asf/oodt/trunk/mvn/archetypes/radix/src/main/resources/archetype-resources/README.txt
>>>> )
>>>> 3. Connect OODT FM with Solr manually, see:
>>>> https://cwiki.apache.org/confluence/display/OODT/Integrating+Solr+with+OODT+RADiX
>>>>
>>>>  *If you already have a deployed OODT FM:*
>>>> 1. Follow these directions:
>>>> https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+Start+Guide
>>>> 2. If the above doesn’t work, then use OODT RADiX to create a FM and
>>>> Solr deployment that works, and copy those directories to your currently
>>>> deployed production directory.
>>>>
>>>>  Thanks - hope that helps!
>>>> Rishi
>>>>
>>>>  On Oct 5, 2014, at 10:56 AM, MengYing Wang <mengyingwa...@gmail.com>
>>>> wrote:
>>>>
>>>>  Dear Prof. Mattmann and Rishi,
>>>>
>>>>  Attached is the nutch and solr directories.
>>>> ​
>>>>   nutch_solr.zip
>>>> <https://docs.google.com/file/d/0B7PYVKDpy0jlSnI3U1lFcGY0WnM/edit?usp=drive_web>
>>>> ​
>>>>  As for problem (6), I could use SolrIndexer instead. Following is my
>>>> File Manager directory.
>>>>
>>>>
>>>> https://drive.google.com/file/d/0B7PYVKDpy0jlVTk2NWFFY2sycW8/view?usp=sharing
>>>>
>>>>  Thank you!
>>>>
>>>>  Best,
>>>> Mengying Wang
>>>>
>>>>
>>>>
>>>> On Sun, Oct 5, 2014 at 9:25 AM, Christian Alan Mattmann <
>>>> mattm...@usc.edu> wrote:
>>>>
>>>>> Thanks Angela. Great work!
>>>>>
>>>>> Some comments/feedback:
>>>>>
>>>>> (1) According to
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>>>>> ,
>>>>>  use the Apache OODT Pushpull to crawl data files from
>>>>> a remote server to the local machine [Failed, no data files downloaded
>>>>> at
>>>>> all].
>>>>> - This problem is not so urgent. Maybe I should use some ftp client
>>>>> tools,
>>>>> e.g., FileZilla, to download data files in the remote ftp servers.
>>>>>
>>>>> MY COMMENT: Please send me your PushPull directory zipped up. I will
>>>>> take a look - Tyler can you also look?
>>>>>
>>>>> (3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch,
>>>>> use
>>>>> the Apache Nutch and Solr to crawl and index local data files [Failed,
>>>>> No data is indexed in Solr].
>>>>> - This problem is not so urgent. Maybe this feature
>>>>> only works for the Nutch 2.x. My Nutch version is 1.9. Also I could use
>>>>> the OODT Crawler to ingest local files.
>>>>>
>>>>>
>>>>> MY COMMENT: Please send me your nutch + solr directories, zipped up.
>>>>> I will take a look.
>>>>>
>>>>> (6) According to
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+St
>>>>> art+Guide, integrate the Apache OODT File Manager
>>>>> with the Apache Solr [Failed, No product information available in the
>>>>> Solr].
>>>>> - It doesn't work out. However, I could use (5) to integrate OODT File
>>>>> Manager and the Solr.
>>>>>
>>>>>
>>>>> MY COMMENT: Rishi, can you guys help Angela with OODT + Solr + FM?
>>>>> It¹s not
>>>>> working for her.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California
>>>>> Los Angeles, CA 90089 USA
>>>>> Email: mattm...@usc.edu
>>>>> WWW: http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: MengYing Wang <mengyingwa...@gmail.com>
>>>>> Date: Saturday, October 4, 2014 at 9:12 PM
>>>>> To: Chris Mattmann <mattm...@usc.edu>, "Mcgibbney, Lewis J (398J)"
>>>>> <lewis.j.mcgibb...@jpl.nasa.gov>
>>>>> Cc: Annie Bryant <anniebry...@gmail.com>, "Ramirez, Paul M (398J)"
>>>>> <paul.m.rami...@jpl.nasa.gov>, Chris Mattmann
>>>>> <chris.a.mattm...@jpl.nasa.gov>
>>>>> Subject: Directed Research Weekly Report from 2014/09/29 - 2014/10/05
>>>>>
>>>>> >Dear Prof. Mattmann,
>>>>> >
>>>>> >
>>>>> >New status of the previous failed problems:
>>>>> >
>>>>> >
>>>>> >(1) According to
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>>>>> >, use the Apache OODT Pushpull to crawl data files from
>>>>> > a remote server to the local machine [Failed, no data files
>>>>> downloaded
>>>>> >at all].
>>>>> >
>>>>> >
>>>>> >- This problem is not so urgent. Maybe I should use some ftp client
>>>>> >tools, e.g., FileZilla, to download data files in the remote ftp
>>>>> servers.
>>>>> >
>>>>> >
>>>>> >(2) Use the Apache OODT Pushpull to crawl webpages [Succeed].
>>>>> >
>>>>> >
>>>>> >(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch
>>>>> ,
>>>>> >use the Apache Nutch and Solr to crawl and index local data files
>>>>> [Failed,
>>>>> > No data is indexed in Solr].
>>>>> >
>>>>> >
>>>>> >- This problem is not so urgent. Maybe this feature
>>>>> > only works for the Nutch 2.x. My Nutch version is 1.9. Also I could
>>>>> use
>>>>> >the OODT Crawler to ingest local files.
>>>>> >
>>>>> >
>>>>> >(4) Integrate the tike parser with the Apache Nutch [Failed, No tike
>>>>> >fields available in the Solr].
>>>>> >
>>>>> >
>>>>> >- Still in progress.
>>>>> >
>>>>> >
>>>>> >(5) According to
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>>>>> >dump+a+File+Manager+Catalog,
>>>>> > use the SolrIndexer to dump all product information from the Apache
>>>>> OODT
>>>>> >File Manager to the Apache Solr [Succeed].
>>>>> >
>>>>> >
>>>>> >(6) According to
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>>>>> >tart+Guide, integrate the Apache OODT File Manager
>>>>> > with the Apache Solr [Failed, No product information available in the
>>>>> >Solr].
>>>>> >
>>>>> >
>>>>> >- It doesn't work out. However, I could use (5) to integrate OODT File
>>>>> >Manager and the Solr.
>>>>> >
>>>>> >
>>>>> >So far, I have two ways to crawl remote data and construct indexes in
>>>>> the
>>>>> >Solr:
>>>>> >
>>>>> >
>>>>> >(1) moving data to the local machine using the FileZilla -> developing
>>>>> >metadata extractor using the Tika -> crawling the data directory using
>>>>> >the OODT Crawler -> migrating product information to the Solr uing the
>>>>> >SolrIndexer
>>>>> >
>>>>> >
>>>>> >(2) crawling websites using the Nutch -> indexing some basic metadata
>>>>> in
>>>>> >the Solr
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >Thanks.
>>>>> >
>>>>> >
>>>>> >Best,
>>>>> >Mengying (Angela) Wang
>>>>> >
>>>>> >On Mon, Sep 29, 2014 at 12:22 PM, MengYing Wang
>>>>> ><mengyingwa...@gmail.com> wrote:
>>>>> >
>>>>> >Dear Prof. Mattmann,
>>>>> >
>>>>> >
>>>>> >In the previous two weeks, I was trying to solve the following
>>>>> problems:
>>>>> >
>>>>> >
>>>>> >(1) According to
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guide
>>>>> >, use the Apache OODT Pushpull to crawl data files from
>>>>> > a remote server to the local machine [Failed, couldn't find the data
>>>>> >files].
>>>>> >
>>>>> >
>>>>> >(2) Use the Apache OODT Pushpull to crawl webpages [Failed, HttpClient
>>>>> >ClassNotFoundException].
>>>>> >
>>>>> >
>>>>> >(3) According to https://wiki.apache.org/nutch/IntranetDocumentSearch
>>>>> ,
>>>>> >use the Apache Nutch and Solr to crawl and index local data files
>>>>> [Failed,
>>>>> > No data files found in Solr].
>>>>> >
>>>>> >
>>>>> >(4) According to
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>>>>> >tart+Guide, search and delete redundant products
>>>>> > in the Apache OODT File Manager [Succeed].
>>>>> >
>>>>> >
>>>>> >(5) According to
>>>>> >https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help,
>>>>> use
>>>>> >the Apache OODT Crawler and Tika to extract metadata and then query
>>>>> > the metadata in the Apache OODT File Manager [Succeed].
>>>>> >
>>>>> >
>>>>> >(6) According to https://wiki.apache.org/nutch/IndexMetatags, use the
>>>>> >plugins to parse HTML meta tags into separate fields in the Solr index
>>>>> >[Succeed].
>>>>> >
>>>>> >
>>>>> >(7) Integrate the tike parser with the Apache Nutch to extract
>>>>> metadata
>>>>> >information which would be indexed in the Solr [Failed, No tike fields
>>>>> >available in the Solr].
>>>>> >
>>>>> >
>>>>> >(8) According to
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/OODT/Using+the+SolrIndexer+to+
>>>>> >dump+a+File+Manager+Catalog,
>>>>> > use the SolrIndexer to dump all product information from the Apache
>>>>> OODT
>>>>> >File Manager to the Apache Solr [Failed, No product information
>>>>> available
>>>>> >in the Solr].
>>>>> >
>>>>> >
>>>>> >(9) According to
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/OODT/Solr+File+Manager+Quick+S
>>>>> >tart+Guide, integrate the
>>>>> > Apache OODT File Manager with the Apache Solr [Failed, No product
>>>>> >information available in the Solr].
>>>>> >
>>>>> >
>>>>> >(10) According to https://lucene.apache.org/solr/4_10_0/tutorial.html
>>>>> ,
>>>>> >explore a simple command line tool for posting, deleting, updating and
>>>>> >querying
>>>>> > raw XMLs to the solr server [Succeed].
>>>>> >
>>>>> >
>>>>> >Thank you.
>>>>> >
>>>>> >
>>>>> >
>>>>> >Best,
>>>>> >Mengying Wang
>>>>> >
>>>>> >
>>>>> >On Wed, Sep 17, 2014 at 11:44 AM, MengYing Wang
>>>>> ><mengyingwa...@gmail.com> wrote:
>>>>> >
>>>>> >Dear Prof. Mattmann,
>>>>> >
>>>>> >
>>>>> >For the last week, I was learning the various apache tool tutorials,
>>>>> and
>>>>> >trying to figure out how to crawl data files in the web, and then
>>>>> build
>>>>> >up a metadata index for future queries. So far, I have found the
>>>>> >following two approaches:
>>>>> >
>>>>> >
>>>>> >1: Use the Apache OODT Pushpull to crawl a bunch of data files from
>>>>> some
>>>>> >remote server to localhost ->  Use the Apache Tika to extract the
>>>>> >metadata information for each data file ->  Use the Apache OODT File
>>>>> >Manager to ingest the metadata files ->  Use
>>>>> > the query_tool script to query the metadata information stored in the
>>>>> >Apache OODT File Manager
>>>>> >
>>>>> >
>>>>> >We could also achieve the above process by employing the Apache OODT
>>>>> >CAS-Curator to automatically call the Apache Tika and the Apache File
>>>>> >Manager, for the details you could refer to
>>>>> >http://oodt.apache.org/components/maven/curator/user/basic.html
>>>>> >
>>>>> >
>>>>> >2: Use the Apache Nutch to crawl a number of webpages -> Use the
>>>>> Apache
>>>>> >Solr to do the text queries.
>>>>> >
>>>>> >
>>>>> >However, there are some problems that I am still trying to solve:
>>>>> >
>>>>> >
>>>>> >(1) According to the Apache OODT Pushpull user guide
>>>>> >(
>>>>> https://cwiki.apache.org/confluence/display/OODT/OODT+Push-Pull+User+Guid
>>>>> >e), data files should
>>>>> > be downloaded to the staging area. However, when I started the
>>>>> pushpull
>>>>> >script, I have waited for at least 15 minutes but nothing was
>>>>> downloaded.
>>>>> >I have checked the remote FTP server, there indeed are some data
>>>>> files.
>>>>> >-_-!
>>>>> >
>>>>> >
>>>>>
>>>>> >**************************************************************************
>>>>> >***********
>>>>> >guest-wireless-207-151-035-013:bin AngelaWang$ ./pushpull
>>>>> >TRANSFER:
>>>>> >org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>>>>> >^C
>>>>>
>>>>> >**************************************************************************
>>>>> >***********
>>>>> >
>>>>> >
>>>>> >Also, url-downloader script cannot work because of the java
>>>>> >   NoClassDefFoundError.
>>>>> >
>>>>> >
>>>>>
>>>>> >**************************************************************************
>>>>> >***********
>>>>> >
>>>>> >guest-wireless-207-151-035-013:bin AngelaWang$ ./url-downloader
>>>>> >
>>>>> >
>>>>> http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT
>>>>> ><
>>>>> http://pds-imaging.jpl.nasa.gov/data/msl/MSLHAZ_0XXX/CATALOG/CATINFO.TXT
>>>>> >
>>>>> > .
>>>>> >Exception in thread "main" java.lang.NoClassDefFoundError:
>>>>> >org/apache/oodt/cas/pushpull/protocol/http/HttpClient
>>>>> >Caused by: java.lang.ClassNotFoundException:
>>>>> >org.apache.oodt.cas.pushpull.protocol.http.HttpClient
>>>>> >at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>>>>> >at java.security.AccessController.doPrivileged(Native Method)
>>>>> >at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>>>> >at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>>>>> >at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>>>> >at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>>>>> >
>>>>>
>>>>> >**************************************************************************
>>>>> >***********
>>>>> >
>>>>> >
>>>>> >
>>>>> >2: According to the Apache OODT Crawler Help
>>>>> >(https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help),
>>>>> the
>>>>> >Apache OODT Crawler could be integrated
>>>>> > with the Apache Tika. However, there is no
>>>>> >org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor class
>>>>> in
>>>>> >my Apache OODT Crawler package.
>>>>> >
>>>>> >
>>>>> >3: How to dump the metadata in the Apache OODT File Manager to the
>>>>> Apache
>>>>> >Solr using the Apache OODT Workflow Manager? I still have no clear
>>>>> answer
>>>>> >yet.
>>>>> >
>>>>> >
>>>>> >4: According to the Apache Solr Tutorial
>>>>> >(https://lucene.apache.org/solr/4_10_0/tutorial.html), users should
>>>>> be
>>>>> >able to add/delete/update documents using post.jar script.
>>>>> > However, it doesn't work in my machine.
>>>>> >
>>>>> >
>>>>>
>>>>> >**************************************************************************
>>>>> >***********
>>>>> >
>>>>> >guest-wireless-207-151-035-013:exampledocs AngelaWang$ java -jar
>>>>> post.jar
>>>>> >solr.xml
>>>>> >SimplePostTool version 1.5
>>>>> >Posting files to base url
>>>>>  >http://localhost:8983/solr/update <http://localhost:8983/solr/update
>>>>> >
>>>>>  >using content-type application/xml..
>>>>> >POSTing file solr.xml
>>>>> >SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for
>>>>> >url:
>>>>> >http://localhost:8983/solr/update
>>>>> >SimplePostTool: WARNING: Response: <?xml version="1.0"
>>>>> encoding="UTF-8"?>
>>>>> ><response>
>>>>> ><lst name="responseHeader"><int name="status">400</int><int
>>>>> >name="QTime">1</int></lst><lst name="error"><str name="msg">ERROR:
>>>>> >[doc=SOLR1000] unknown field 'name'</str><int
>>>>> name="code">400</int></lst>
>>>>> ></response>
>>>>> >SimplePostTool: WARNING: IOException while reading response:
>>>>> >java.io.IOException: Server returned HTTP response code: 400 for URL:
>>>>> >http://localhost:8983/solr/update
>>>>> >1 files indexed.
>>>>> >COMMITting Solr index changes to
>>>>> >http://localhost:8983/solr/update..
>>>>> >Time spent: 0:00:00.032
>>>>> >
>>>>>
>>>>> >**************************************************************************
>>>>> >***********
>>>>> >
>>>>> >
>>>>> >
>>>>> >Solr logs:
>>>>> >
>>>>> >
>>>>>
>>>>> >**************************************************************************
>>>>> >***********
>>>>> >
>>>>> >6506114 [qtp1314570047-14] ERROR org.apache.solr.core.SolrCore  ­
>>>>> >org.apache.solr.common.SolrException: ERROR: [doc=SOLR1000] unknown
>>>>> field
>>>>> >'name'
>>>>> >at
>>>>>
>>>>> >org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185
>>>>> >)
>>>>> >at
>>>>>
>>>>> >org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand
>>>>> >.java:78)
>>>>> >at
>>>>>
>>>>> >org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.j
>>>>> >ava:238)
>>>>> >at
>>>>>
>>>>> >org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.ja
>>>>> >va:164)
>>>>> >at
>>>>>
>>>>> >org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePr
>>>>> >ocessorFactory.java:69)
>>>>> >.......
>>>>>
>>>>> >**************************************************************************
>>>>> >***********
>>>>> >
>>>>> >
>>>>> >
>>>>> >I will continue to solve the above problems this week, and could we
>>>>> >discuss the two approaches this Thursday after the class? Many thanks!
>>>>> >Have a good day!
>>>>> >
>>>>> >
>>>>> >Best,
>>>>> >Mengying (Angela) Wang
>>>>> >
>>>>> >
>>>>> >
>>>>> >On Mon, Sep 8, 2014 at 10:32 PM, MengYing Wang
>>>>> ><mengyingwa...@gmail.com> wrote:
>>>>> >
>>>>> >Dear Prof. Mattmann,
>>>>> >
>>>>> >
>>>>> >For the previous week, I have successfully installed the following
>>>>> >softwares in my personal computer:
>>>>> >
>>>>> >
>>>>> >1: Apache OODT Catalog and Archive File Management Component:
>>>>> >http://oodt.apache.org/components/maven/filemgr/user/basic.html
>>>>> >2: Apache OODT Catalog and Archive Crawling Framework:
>>>>> >http://oodt.apache.org/components/maven/crawler/user/
>>>>> >
>>>>> >3: Apache OODT Catalog and Archive Workflow Management Component:
>>>>> >http://oodt.apache.org/components/maven/workflow/user/basic.html
>>>>> >
>>>>> >4: Apache Solr:
>>>>> >https://cwiki.apache.org/confluence/display/solr/Installing+Solr
>>>>> >5: Apache Nutch:
>>>>> >
>>>>> http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website
>>>>> >6: Apache Tika: http://tika.apache.org/0.9/gettingstarted.html
>>>>> >
>>>>> >
>>>>> >This week I will continue to playing with these softwares to figure
>>>>> out
>>>>> >the following three questions:
>>>>> >(1) how to get the metadata using Apache OODT or Apache Nutch?
>>>>> >(2) how to dump the metadata from Apache OODT to Apache Solr?
>>>>> >(3) how to query the metadata stored in Solr?
>>>>> >
>>>>> >Best,
>>>>> >Mengying (Angela) Wang
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >--
>>>>> >Best,
>>>>> >Mengying (Angela) Wang
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >--
>>>>> >Best,
>>>>> >Mengying (Angela) Wang
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >--
>>>>> >Best,
>>>>> >Mengying (Angela) Wang
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>>  Best,
>>>> Mengying (Angela) Wang
>>>>
>>>>
>>>>   ---
>>>> Rishi Verma
>>>> NASA Jet Propulsion Laboratory
>>>> California Institute of Technology
>>>> 4800 Oak Grove Drive, M/S 158-248
>>>> Pasadena, CA 91109
>>>> Tel: 1-818-393-5826
>>>>
>>>>
>>>
>>>
>>>  --
>>>  Best,
>>> Mengying (Angela) Wang
>>>
>>>
>>>  ---
>>> Rishi Verma
>>> NASA Jet Propulsion Laboratory
>>> California Institute of Technology
>>>
>>>
>>
>>
>>  --
>>  Best,
>> Mengying (Angela) Wang
>>
>
>
>
>  --
>  Best,
> Mengying (Angela) Wang
>
>
>  ---
> Rishi Verma
> NASA Jet Propulsion Laboratory
> California Institute of Technology
>
>


-- 
Best,
Mengying (Angela) Wang



-- 
Best,
Mengying (Angela) Wang

Reply via email to