[
https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630674#comment-13630674
]
Lewis John McGibbney commented on NUTCH-1557:
---------------------------------------------
Hi Chao,
Do you have any patch proposal for this?
What is your requirement behind this issue?
> File extraction and classification for any MIME types from segments
> -------------------------------------------------------------------
>
> Key: NUTCH-1557
> URL: https://issues.apache.org/jira/browse/NUTCH-1557
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.6
> Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software:
> Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
> Reporter: Chao Yan
> Priority: Minor
>
> Basic idea is to implement a file dumper as a plugin to extra files from
> Nutch SequenceFiles. The file dumper should detect the content type and dump
> them into different directories based on content type. The extracted file
> will be renamed based on information from URL, metadata, and even content.
> File name should be globally unique with the correct file extension. The file
> dumper should also allow user to specify the formats of the files they want,
> and can be extended to specify any criteria on the extracted files. A more
> advanced goal is to implement it with MapReduce.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira