[
https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630852#comment-13630852
]
Chao Yan commented on NUTCH-1557:
---------------------------------
Hi Lewis,
I am still trying to build a usable patch. The segment dumper will serve as a
plugin for Nutch to dump files from SequenceFiles, but I am still not clear
that which extension-point it should be mount to.
The dumper requires a mimes.type file which contains the mapping from mime
types to file extensions and a third party library.
> File extraction and classification for any MIME types from segments
> -------------------------------------------------------------------
>
> Key: NUTCH-1557
> URL: https://issues.apache.org/jira/browse/NUTCH-1557
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.6
> Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software:
> Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
> Reporter: Chao Yan
> Priority: Minor
> Attachments: FileDumper.java, readme.txt
>
>
> Basic idea is to implement a file dumper as a plugin to extra files from
> Nutch SequenceFiles. The file dumper should detect the content type and dump
> them into different directories based on content type. The extracted file
> will be renamed based on information from URL, metadata, and even content.
> File name should be globally unique with the correct file extension. The file
> dumper should also allow user to specify the formats of the files they want,
> and can be extended to specify any criteria on the extracted files. A more
> advanced goal is to implement it with MapReduce.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira