Avoid parsing uneccessary links and get a more relevant outlink list
--------------------------------------------------------------------

                 Key: NUTCH-488
                 URL: https://issues.apache.org/jira/browse/NUTCH-488
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
         Environment: Windows, Java 1.5
            Reporter: Emmanuel Joke


NekoHTML parser use a method to extract all outlinks from the HTML page. It 
will extracts them from the HTML content based on the list of param defined in 
the method setConf(). Then this list of links will be truncated to be limit to 
the the maximum number of outlinks that we'll process for a page defined in 
nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and finally it 
will be go through all urlfilter defined.

Unfortunetly it can happen that the list of outlinks is more than 100, so it 
will truncated the list and could remove some relevant links.

So I've added few options in the nutch-default.xml in order to enable/disable 
the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM, 
LINK).



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to