[
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexis updated NUTCH-965:
-------------------------
Attachment: parserJob.patch
In the parser mapper, compare Content-Length header to the size of the content
buffer to see if they match.
If this HTTP header is available and in the case that the file was truncated,
skip the parsing step to avoid that the parser gets stuck in infinite loop
taking up all the CPU resources.
Before, in the logs, we would see:
{noformat}2011-02-07 14:03:34,693 WARN parse.ParseUtil - TIMEOUT parsing
http://downtownjoes.com/botb1.flv with
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:03:34,693 WARN parse.ParseUtil - Unable to successfully parse
content http://downtownjoes.com/botb1.flv of type video/x-flv
2011-02-07 14:04:04,725 WARN parse.ParseUtil - TIMEOUT parsing
http://downtownjoes.com/dtj.flv with
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:04,725 WARN parse.ParseUtil - Unable to successfully parse
content http://downtownjoes.com/dtj.flv of type video/x-flv
2011-02-07 14:04:34,772 WARN parse.ParseUtil - TIMEOUT parsing
http://downtownjoes.com/botb2.flv with
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:34,772 WARN parse.ParseUtil - Unable to successfully parse
content http://downtownjoes.com/botb2.flv of type video/x-flv
{noformat}
After:
{noformat}2011-02-08 09:06:54,482 INFO parse.ParserJob -
http://downtownjoes.com/botb1.flv skipped. Content of size 4527822 was
truncated to 63980
2011-02-08 09:06:54,482 INFO parse.ParserJob - http://downtownjoes.com/dtj.flv
skipped. Content of size 2692082 was truncated to 63980
2011-02-08 09:06:54,482 INFO parse.ParserJob -
http://downtownjoes.com/botb2.flv skipped. Content of size 35496213 was
truncated to 61058
{noformat}
> Parsing takes up 100% CPU
> -------------------------
>
> Key: NUTCH-965
> URL: https://issues.apache.org/jira/browse/NUTCH-965
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Reporter: Alexis
> Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is
> described here:
> http://www.mail-archive.com/[email protected]/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted
> data due to for example truncating big binary files at fetch time.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira