[Nutch-dev] [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Paul Baclace (JIRA) Mon, 09 Jan 2006 16:57:09 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362272 ]


Paul Baclace commented on NUTCH-153:
------------------------------------

> NUTCH-160?

There is slowness and then there is continental drift.  The quantifiers should 
be used with any regex package unless the quantifier itself is a significant 
cost during match().  

The general solution is non-fatal per-file time limits on parsers, at least 
when regular expressions (OutlinkExtractor) are used.  That is, spawn a daemon 
thread as an alarm to interrupt() the thread doing match().  

I could make a match() timeout patch, but I have also seen a case where tagsoup 
spent a huge amount of time parsing files of type text/vnd.viewcvs-markup; I 
don't know what causes the problem, but this MIME type must be high in 
tortuosity since Chandler's mime-torture tests includes many examples.  Thus, a 
general solution of non-fatal per-file time limits on parsing files would be 
better placed to take care of present and future problems of this type.



> TextParser is only supposed to parse plain text, but if given postscript, it 
> can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can 
> be avoided with careful configuration, but if the server MIME type is wrong 
> and the basename of the URL has no "file extension", then the this parser 
> will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug 
> NUTCH-150, but the problem cannot be entirely addressed with that patch since 
> the first call to reg expr match() can take a long time, despite quantifier 
> limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of 
> the file.
> Actual experience has shown that for safety and fail-safe reasons, it is 
> worth protecting against GIGO directly in TextParse for this case, even 
> though the suggested fix is not a general solution.  (A general solution 
> would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Reply via email to