First a question about the current behavior... does Nutch adhere to
the <a rel="nofollow"...> conventions? If so, where is that coded?
On a related note, it seems carrying metadata around on Outlink would
be beneficial, not just anchor text and URL. For example, my
application will crawl HTML sites with a HEAD <link> to RDF data.
I'd like to, in an HtmlParseFilter, add ParseData metadata so that an
indexer (a custom one currently, not the Nutch one) can get at the
RDF data that has been fetched by the URL stored in the metadata.
Make sense?
Would my use indicate that Outlink should carry along metadata or is
there another way to achieve this (besides writing a custom HTML
parser)?
Thanks,
Erik