[ 
https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630674#comment-13630674
 ] 

Lewis John McGibbney commented on NUTCH-1557:
---------------------------------------------

Hi Chao,
Do you have any patch proposal for this?
What is your requirement behind this issue?
                
> File extraction and classification for any MIME types from segments
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1557
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>         Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: 
> Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
>            Reporter: Chao Yan
>            Priority: Minor
>
> Basic idea is to implement a file dumper as a plugin to extra files from 
> Nutch SequenceFiles. The file dumper should detect the content type and dump 
> them into different directories based on content type. The extracted file 
> will be renamed based on information from URL, metadata, and even content. 
> File name should be globally unique with the correct file extension. The file 
> dumper should also allow user to specify the formats of the files they want, 
> and can be extended to specify any criteria on the extracted files. A more 
> advanced goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to