[ http://issues.apache.org/jira/browse/NUTCH-33?page=history ]
Jerome Charron updated NUTCH-33: -------------------------------- Attachment: NUTCH-33.patch mime-types.tar.gz The attachements contains: 1. mime-types.tar.gz This file contains: * the java source code for a Mime Content Type / Extension / Magic bytes mapper. * the java source code for unit tests + some tests files (gif, png, html, ...) for magic detection tests * the conf file (mime-types.xml) that contains all these mappings (it includes those that were in the "old" mime.types file) * a dtd that describes the schema of the mime-types.xml 2. NUTCH-33.patch This file contains: * Modifications on conf/nutch-default.xml (removal of mime.magic.file, modification of mime.types.file to use the new mime-types.xml file) * Patch for plugin protocol-ftp: It now uses the code provided in mime-types.tar.gz instead of activation code. * Patch for plugin protocol-file: It now uses the code provided in mime-types.tar.gz instead of activation code. * Patch for plugin index-more: It now uses the code provided in mime-types.tar.gz but still uses activation code too. * Patch for plugins build.xml for activating build of protocol-ftp and protocol-file (there is no more license issue with activation). Status: I successfully build this whole contribution from the nutch trunk. I successfully perform unit tests. I successfully perform funtional tests. Todo: * Since activation is used in index-more plugin only for extracting primary and subtype from a full content type string, I planned to provides such feature in mime-types code. * For others TODO, see TODO comments in the code. > MIME content type detector (using magic char sequences) > ------------------------------------------------------- > > Key: NUTCH-33 > URL: http://issues.apache.org/jira/browse/NUTCH-33 > Project: Nutch > Type: New Feature > Reporter: Jerome Charron > Priority: Minor > Attachments: NUTCH-33.patch, mime-types.tar.gz > > Extension based content-type detector is not suffisant in some cases. > The solution is to add a content type detector based on some magic char > sequences like in apache httpd for instance. > (Note: I created this issue only to keep a trace, but I'm currently working > on it) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers