Hi Matthias,
Several years ago when I did crawling/parsing/indexing of full-page content for
Simpy.com I used Nutch in exactly that manner.
For example (this is outdated code, but you'll get the idea):
System.out.println("Urls to fetch: " + _urls.size());
if (_urls.size() == 0)
return;
// clean up and prepare the FS
prepareFS();
// create the URL file
String urlFile = createURLFile();
// create the fetch list from the URL file
createFetchList(urlFile);
// start the fetcher
_segmentDir = getLastSegmentDirectory(_rootDir);
String[] params = new String[] { // THIS IS
WHAT YOU ARE AFTER
"-local",
_segmentDir
};
org.apache.nutch.fetcher.Fetcher.main(params); // THIS IS WHAT YOU
ARE AFTER
If you look at bin/nutch script, you will see it really just calls Nutch's Java
classes, so you just have to figure out what parameters those classes take and
then call them as above, or even more directly using ctor and methods other
than main.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Matthias W. <[email protected]>
> To: [email protected]
> Sent: Tuesday, January 13, 2009 7:17:50 AM
> Subject: nutch crawling with java (not shellscript)
>
>
> Hi,
> is there a tutorial or can anyone explain if and how I can run the nutch
> crawler via java and not with the shellscript?
> Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
> Word, Excel, ... Documents) which I have to index
> -> In my case nutch only has to create the index from the urls list.
>
> Till now I've got a shellscript which calls "bin/nutch crawl ..."
>
> But if it is possible, I want to use java code instead of the "bin/nutch"
> crawlscript.
>
> Are there Java classes and methods to do this?
>
> For better understanding, my association to start the crawl respectively the
> index process:
> "java Crawl"
> That I'm able to set options for crawling in the java code and not in a
> shellscript.
>
> Is this possible?
>
> Thanks!
> Matthias
> --
> View this message in context:
> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
> Sent from the Nutch - User mailing list archive at Nabble.com.