[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
Sami Siren closed NUTCH-150. ---------------------------- > OutlinkExtractor extremely slow on some non-plain text > ------------------------------------------------------ > > Key: NUTCH-150 > URL: http://issues.apache.org/jira/browse/NUTCH-150 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.8 > Environment: All > Reporter: Paul Baclace > Priority: Minor > Fix For: 0.7.2 > > Attachments: OutlinkExtractor.java.patch > > > While using mime settings which aggressively parsed everything by default, > rather than having conf/parse-plugins.xml associate parse-default with *, > some parse tasks took an incredibly long time to finish. For instance, a > single postscript file took 9 hours to parse. Stacktraces indicated this to > be a problem with OutlinkExtractor.getOutlinks(...) during the call to reg > expr match(). > Analysis: The regular expression matching in > OutlinkExtractor.getOutlinks(...) encounters parasitic cases which have > extremely long runtimes when non-plain-text is processed. > Workaround 1: Avoid treating non-plain-text, especially postscript files, as > text or html. > Workaround 2: kill -SIGQUIT the child TaskRunner process, this will > interrupt the match() and the process will continue. This might need to be > done multiple times. (In theory, SIGQUIT is not supposed to do this, but in > practice it does.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
