Chris,

After running "bin/nutch crawl ...." I would like

to access the contents of the fetched pages programmaticaly

for further processing. I would then like to update the

database and the index with the post-processed pages.

Do you already take a look in the new creative commons plugin?

 public Document filter(Document doc, Parse parse, FetcherOutput fo)
    throws IndexingException {

    // get the license that was extracted by the parse filter
    String licenseUrl = parse.getData().get("License-Url");

if (licenseUrl != null) {
// add it as stored and indexed, so it's both searchable and returned
doc.add(Field.Text("license", licenseUrl));


      LOG.info("CC: indexing "+licenseUrl+" for: "+fo.getUrl());
    }

    return doc;
  }




I have looked through some of the source code as

well as the java docs, however I am unable to determine

which classes will help me access the page contents from

the database.



http://www.nutch.org/cgi-bin/twiki/view/Main/WebDB



Also, is it possible to update the database and index

after processing the fetched pages? If yes, what may

this require?

The create commons code shows you how you can add custom meta data to the index.
As well the fresh language identifier plugin illustrate it as well.


HTH
Stefan




-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to