[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Paul Baclace (JIRA) Fri, 06 Jan 2006 12:00:39 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362000 ]


Paul Baclace commented on NUTCH-153:
------------------------------------


> mime.type.magic?

The particular run that had problems was using mime.type.magic=true.  It turns 
out that the magic "%!PS-Adobe"  was preceeded by some spaces so it was not 
recognized.

The intent of this bug is that no matter why some content is passed to 
TextParser, there should not be parasitic cases that take too long to process.  
(Parsing one file for hours is equivalent to being fatal.) There are per-file 
space limits on parsing (first N bytes), but the only time limit is at the Task 
level (an hour of inactivity) and it is fatal on the third (default) attempt. 

It makes sense to have non-fatal per-file time limits on parsers when regular 
expressions (OutlinkExtractor) are used since some regexprs are prone to having 
parasitic cases that take a long time instead of blowing up a stack.

> strings command line like parser [filter]

This is a related and good idea, but a different beast.  The idea is to improve 
recall by grabbing marginal shreds of tokens out of files with unknown formats. 
 For this to be effective and not annoying, it needs a threshhold for minimal % 
of content found, or minimal density, to accept  any tokens from a particular 
file in order to reject binary files that just happen to hit upon reasonable 
strings. 

(Reasonableness depends on charset/language, as pointed out by KuroSaka 
TeruHiko, but minimal ascii, a.k.a. romanji would be the most effective 
worldwide.) 

It also should have a way to set the weight of the tokens found that would take 
into account the density of reasonable tokens.  That is, a similarly sized 
f.txt would rank higher than a mystery-format f.huh with the same number of 
token matches plus 70% binary.



> TextParser is only supposed to parse plain text, but if given postscript, it 
> can take hours and then fail
> ---------------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-153
>          URL: http://issues.apache.org/jira/browse/NUTCH-153
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: all
>     Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can 
> be avoided with careful configuration, but if the server MIME type is wrong 
> and the basename of the URL has no "file extension", then the this parser 
> will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug 
> NUTCH-150, but the problem cannot be entirely addressed with that patch since 
> the first call to reg expr match() can take a long time, despite quantifier 
> limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of 
> the file.
> Actual experience has shown that for safety and fail-safe reasons, it is 
> worth protecting against GIGO directly in TextParse for this case, even 
> though the suggested fix is not a general solution.  (A general solution 
> would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

Reply via email to