[ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-965:
---------------------------------------

    Attachment: NUTCH-965-v2.patch

Hi Guys,

I would ask you's to comment as this patch is not finished yet. Although I've 
made the functionality a boolean configurable, I've also intentionally 
neglected to address the second of your points Julien, regarding 
FetcherJob.java.

I see that the boolean parsing value is set in this class [1], but would like 
you to confirm if the code I'm writing should live under the public Collection 
object on line 138.

Once this is addressed it would be great to get a patch for trunk.

Thanks for anyone that can comment on this. 

[1] 
http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java?view=markup
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is 
> described here:
> http://www.mail-archive.com/[email protected]/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted 
> data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to