[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
     
Doug Cutting resolved NUTCH-150:
--------------------------------

    Fix Version: 0.7.2-dev
     Resolution: Fixed

I just committed this.  Thanks, Paul!

> OutlinkExtractor extremely slow on some non-plain text
> ------------------------------------------------------
>
>          Key: NUTCH-150
>          URL: http://issues.apache.org/jira/browse/NUTCH-150
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>  Environment: All
>     Reporter: Paul Baclace
>     Priority: Minor
>      Fix For: 0.7.2-dev
>  Attachments: OutlinkExtractor.java.patch
>
> While using mime settings which aggressively parsed everything by default, 
> rather than having conf/parse-plugins.xml  associate parse-default with *, 
> some parse tasks took an incredibly long time to finish.  For instance, a 
> single postscript file took 9 hours to parse.  Stacktraces indicated this to 
> be a problem with OutlinkExtractor.getOutlinks(...) during the call to reg 
> expr match().  
> Analysis:  The regular expression matching in 
> OutlinkExtractor.getOutlinks(...) encounters parasitic cases which have 
> extremely long runtimes when non-plain-text is processed.
> Workaround 1:  Avoid treating non-plain-text, especially postscript files, as 
> text or html.
> Workaround 2:  kill -SIGQUIT  the child TaskRunner process, this will 
> interrupt the match() and the process will continue.  This might need to be 
> done multiple times.  (In theory, SIGQUIT is not supposed to do this, but in 
> practice it does.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to