Chris,
After running "bin/nutch crawl ...." I would like
to access the contents of the fetched pages programmaticaly
for further processing. I would then like to update the
database and the index with the post-processed pages.
Do you already take a look in the new creative commons plugin?
public Document filter(Document doc, Parse parse, FetcherOutput fo)
throws IndexingException { // get the license that was extracted by the parse filter
String licenseUrl = parse.getData().get("License-Url");if (licenseUrl != null) {
// add it as stored and indexed, so it's both searchable and returned
doc.add(Field.Text("license", licenseUrl));
LOG.info("CC: indexing "+licenseUrl+" for: "+fo.getUrl());
}return doc; }
I have looked through some of the source code as
well as the java docs, however I am unable to determine
which classes will help me access the page contents from
the database.
http://www.nutch.org/cgi-bin/twiki/view/Main/WebDB
Also, is it possible to update the database and index
after processing the fetched pages? If yes, what may
this require?
The create commons code shows you how you can add custom meta data to the index.
As well the fresh language identifier plugin illustrate it as well.
HTH
Stefan
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
