First a question about the current behavior... does Nutch adhere to the <a rel="nofollow"...> conventions? If so, where is that coded?

On a related note, it seems carrying metadata around on Outlink would be beneficial, not just anchor text and URL. For example, my application will crawl HTML sites with a HEAD <link> to RDF data. I'd like to, in an HtmlParseFilter, add ParseData metadata so that an indexer (a custom one currently, not the Nutch one) can get at the RDF data that has been fetched by the URL stored in the metadata. Make sense?

Would my use indicate that Outlink should carry along metadata or is there another way to achieve this (besides writing a custom HTML parser)?

Thanks,
    Erik

Reply via email to