[ 
https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630852#comment-13630852
 ] 

Chao Yan edited comment on NUTCH-1557 at 4/13/13 1:22 AM:
----------------------------------------------------------

Hi Lewis,
I am still trying to build a usable patch. The segment dumper will serve as a 
plugin for Nutch to dump files from SequenceFiles, but I am still not clear 
that which extension-point it should be mount to.
The dumper requires a mimes.type file which contains the mapping from mime 
types to file extensions and it also requires a third party library.
                
      was (Author: aceyan):
    Hi Lewis,
I am still trying to build a usable patch. The segment dumper will serve as a 
plugin for Nutch to dump files from SequenceFiles, but I am still not clear 
that which extension-point it should be mount to.
The dumper requires a mimes.type file which contains the mapping from mime 
types to file extensions and a third party library.
                  
> File extraction and classification for any MIME types from segments
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1557
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>         Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: 
> Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
>            Reporter: Chao Yan
>            Priority: Minor
>         Attachments: FileDumper.java, readme.txt
>
>
> Basic idea is to implement a file dumper as a plugin to extra files from 
> Nutch SequenceFiles. The file dumper should detect the content type and dump 
> them into different directories based on content type. The extracted file 
> will be renamed based on information from URL, metadata, and even content. 
> File name should be globally unique with the correct file extension. The file 
> dumper should also allow user to specify the formats of the files they want, 
> and can be extended to specify any criteria on the extracted files. A more 
> advanced goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to