Chao Yan created NUTCH-1557:
-------------------------------
Summary: File extraction and classification for any MIME types
from segments
Key: NUTCH-1557
URL: https://issues.apache.org/jira/browse/NUTCH-1557
Project: Nutch
Issue Type: New Feature
Components: parser
Affects Versions: 1.6
Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software:
Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
Reporter: Chao Yan
Priority: Minor
Basic idea is to implement a file dumper as a plugin to extra files from Nutch
SequenceFiles. The file dumper should detect the content type and dump them
into different directories based on content type. The extracted file will be
renamed based on information from URL, metadata, and even content. File name
should be globally unique with the correct file extension. The file dumper
should also allow user to specify the formats of the files they want, and can
be extended to specify any criteria on the extracted files. A more advanced
goal is to implement it with MapReduce.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira