[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
Doug Cutting resolved NUTCH-150:
--------------------------------
Fix Version: 0.7.2-dev
Resolution: Fixed
I just committed this. Thanks, Paul!
> OutlinkExtractor extremely slow on some non-plain text
> ------------------------------------------------------
>
> Key: NUTCH-150
> URL: http://issues.apache.org/jira/browse/NUTCH-150
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Environment: All
> Reporter: Paul Baclace
> Priority: Minor
> Fix For: 0.7.2-dev
> Attachments: OutlinkExtractor.java.patch
>
> While using mime settings which aggressively parsed everything by default,
> rather than having conf/parse-plugins.xml associate parse-default with *,
> some parse tasks took an incredibly long time to finish. For instance, a
> single postscript file took 9 hours to parse. Stacktraces indicated this to
> be a problem with OutlinkExtractor.getOutlinks(...) during the call to reg
> expr match().
> Analysis: The regular expression matching in
> OutlinkExtractor.getOutlinks(...) encounters parasitic cases which have
> extremely long runtimes when non-plain-text is processed.
> Workaround 1: Avoid treating non-plain-text, especially postscript files, as
> text or html.
> Workaround 2: kill -SIGQUIT the child TaskRunner process, this will
> interrupt the match() and the process will continue. This might need to be
> done multiple times. (In theory, SIGQUIT is not supposed to do this, but in
> practice it does.)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira