Hi Matthias,

Several years ago when I did crawling/parsing/indexing of full-page content for 
Simpy.com I used Nutch in exactly that manner.

For example (this is outdated code, but you'll get the idea):

       System.out.println("Urls to fetch: " + _urls.size());

        if (_urls.size() == 0)
            return;

        // clean up and prepare the FS
        prepareFS();

        // create the URL file
        String urlFile = createURLFile();

        // create the fetch list from the URL file
        createFetchList(urlFile);

        // start the fetcher
        _segmentDir = getLastSegmentDirectory(_rootDir);
        String[] params = new String[] {                           // THIS IS 
WHAT YOU ARE AFTER
            "-local",
            _segmentDir
        };
        org.apache.nutch.fetcher.Fetcher.main(params);   // THIS IS WHAT YOU 
ARE AFTER


If you look at bin/nutch script, you will see it really just calls Nutch's Java 
classes, so you just have to figure out what parameters those classes take and 
then call them as above, or even more directly using ctor and methods other 
than main.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Matthias W. <[email protected]>
> To: [email protected]
> Sent: Tuesday, January 13, 2009 7:17:50 AM
> Subject: nutch crawling with java (not shellscript)
> 
> 
> Hi,
> is there a tutorial or can anyone explain if and how I can run the nutch
> crawler via java and not with the shellscript?
> Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
> Word, Excel, ... Documents) which I have to index
> -> In my case nutch only has to create the index from the urls list.
> 
> Till now I've got a shellscript which calls "bin/nutch crawl ..."
> 
> But if it is possible, I want to use java code instead of the "bin/nutch"
> crawlscript.
> 
> Are there Java classes and methods to do this?
> 
> For better understanding, my association to start the crawl respectively the
> index process:
>     "java Crawl"
> That I'm able to set options for crawling in the java code and not in a
> shellscript.
> 
> Is this possible?
> 
> Thanks!
> Matthias
> -- 
> View this message in context: 
> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to