[jira] Created: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Paul Baclace (JIRA) Mon, 26 Dec 2005 19:29:55 -0800

TextParser is only supposed to parse plain text, but if given postscript, it 
can take hours and then fail
---------------------------------------------------------------------------------------------------------


         Key: NUTCH-153
         URL: http://issues.apache.org/jira/browse/NUTCH-153
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.8-dev    
 Environment: all
    Reporter: Paul Baclace


If TextParser is given postscript, it can take hours and then fail.  This can 
be avoided with careful configuration, but if the server MIME type is wrong and 
the basename of the URL has no "file extension", then the this parser will take 
a long time and fail every time.

Analysis: The real problem is OutlinkExtractor.java as reported with bug 
NUTCH-150, but the problem cannot be entirely addressed with that patch since 
the first call to reg expr match() can take a long time, despite quantifier 
limits.  

Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the 
file.

Actual experience has shown that for safety and fail-safe reasons, it is worth 
protecting against GIGO directly in TextParse for this case, even though the 
suggested fix is not a general solution.  (A general solution would be a 
timeout on match().)



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Reply via email to