Hi again,
Dennis Kubes wrote:
> Nutch gets outlinks from the pages it parses. This is either during the
> fetch process with parsing enabled or during just a parse process (see
> org.apache.nutch.parse.ParseSegment). The content is parsed via plugins
> configured in parse-plugins.xml in the conf directory. During the parse
> links are created as Outlink objects that are added to a ParseData
> object that is itself added to a Parse object. During the writing out
> of the parse object (ParseOutputFormat) the outlinks are saved as
> CrawlDatums in the crawl_parse directory under the segment. Then during
> the UpdateDb job (see CrawlDb) this crawl_parse is merged into the
> master Crawl Database. That is the long answer.
>
> Short answer is when you parse get Outlinks and add them to the
> ParseData -> Parse object and then will be updated automatically to he
> CrawlDb when the UpdateDb job is run and it will be fetched when the
> next Fetch job is run.
I was attempting to do this from an HtmlParseFilter plugin, at which
point the data is already parsed and the Outlinks have already been
created. I thought there might be a way to modify the Outlinks at this
point, but I haven't found one.
It looks like the work that I'm interested on is being done in
DOMContentUtils.getOutlinks, the relevant bit of code from HtmlParser being:
utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
outlinks = (Outlink[])l.toArray(new Outlink[l.size()]);
Soon after the outlinks are assigned to the ParseData object, and
there's no method to modify that array.
Is there a plugin type that would allow me to extend this without
altering the HtmlParse plugin, or at least DOMContentUtils?
I'm just getting acquainted with Nutch organization, so please be
patient if I ask an obvious question. Thanks in advance,
Ricardo J. Méndez
http://ricardo.strangevistas.net/