[jira] Closed: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

Sami Siren (JIRA) Tue, 24 Oct 2006 08:35:20 -0700

     [ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]


Sami Siren closed NUTCH-150.
----------------------------


> OutlinkExtractor extremely slow on some non-plain text
> ------------------------------------------------------
>
>                 Key: NUTCH-150
>                 URL: http://issues.apache.org/jira/browse/NUTCH-150
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: All
>            Reporter: Paul Baclace
>            Priority: Minor
>             Fix For: 0.7.2
>
>         Attachments: OutlinkExtractor.java.patch
>
>
> While using mime settings which aggressively parsed everything by default, 
> rather than having conf/parse-plugins.xml  associate parse-default with *, 
> some parse tasks took an incredibly long time to finish.  For instance, a 
> single postscript file took 9 hours to parse.  Stacktraces indicated this to 
> be a problem with OutlinkExtractor.getOutlinks(...) during the call to reg 
> expr match().  
> Analysis:  The regular expression matching in 
> OutlinkExtractor.getOutlinks(...) encounters parasitic cases which have 
> extremely long runtimes when non-plain-text is processed.
> Workaround 1:  Avoid treating non-plain-text, especially postscript files, as 
> text or html.
> Workaround 2:  kill -SIGQUIT  the child TaskRunner process, this will 
> interrupt the match() and the process will continue.  This might need to be 
> done multiple times.  (In theory, SIGQUIT is not supposed to do this, but in 
> practice it does.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

Reply via email to