Hi again,
Dennis Kubes wrote:
> Nutch gets outlinks from the pages it parses. This is either during the
> fetch process with parsing enabled or during just a parse process (see
> org.apache.nutch.parse.ParseSegment). The content is parsed via plugins
> configured in parse-plugins.xml in the conf directory. During the parse
> links are created as Outlink objects that are added to a ParseData
> object that is itself added to a Parse object. During the writing out
> of the parse object (ParseOutputFormat) the outlinks are saved as
> CrawlDatums in the crawl_parse directory under the segment. Then during
> the UpdateDb job (see CrawlDb) this crawl_parse is merged into the
> master Crawl Database. That is the long answer.
>
> Short answer is when you parse get Outlinks and add them to the
> ParseData -> Parse object and then will be updated automatically to he
> CrawlDb when the UpdateDb job is run and it will be fetched when the
> next Fetch job is run.
I was attempting to do this from an HtmlParseFilter plugin, at which
point the data is already parsed and the Outlinks have already been
created. I thought there might be a way to modify the Outlinks at this
point, but I haven't found one.
It looks like the work that I'm interested on is being done in
DOMContentUtils.getOutlinks, the relevant bit of code from HtmlParser being:
utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
outlinks = (Outlink[])l.toArray(new Outlink[l.size()]);
Soon after the outlinks are assigned to the ParseData object, and
there's no method to modify that array.
Is there a plugin type that would allow me to extend this without
altering the HtmlParse plugin, or at least DOMContentUtils?
I'm just getting acquainted with Nutch organization, so please be
patient if I ask an obvious question. Thanks in advance,
Ricardo J. Méndez
http://ricardo.strangevistas.net/
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general