Chao Yan created NUTCH-1557:
-------------------------------

             Summary: File extraction and classification for any MIME types 
from segments
                 Key: NUTCH-1557
                 URL: https://issues.apache.org/jira/browse/NUTCH-1557
             Project: Nutch
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.6
         Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: 
Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
            Reporter: Chao Yan
            Priority: Minor


Basic idea is to implement a file dumper as a plugin to extra files from Nutch 
SequenceFiles. The file dumper should detect the content type and dump them 
into different directories based on content type. The extracted file will be 
renamed based on information from URL, metadata, and even content. File name 
should be globally unique with the correct file extension. The file dumper 
should also allow user to specify the formats of the files they want, and can 
be extended to specify any criteria on the extracted files. A more advanced 
goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to