[ http://issues.apache.org/jira/browse/NUTCH-33?page=history ]
Jerome Charron updated NUTCH-33:
--------------------------------
Attachment: NUTCH-33.patch
mime-types.tar.gz
The attachements contains:
1. mime-types.tar.gz
This file contains:
* the java source code for a Mime Content Type / Extension / Magic bytes mapper.
* the java source code for unit tests + some tests files (gif, png, html, ...)
for magic detection tests
* the conf file (mime-types.xml) that contains all these mappings (it includes
those that were in the "old" mime.types file)
* a dtd that describes the schema of the mime-types.xml
2. NUTCH-33.patch
This file contains:
* Modifications on conf/nutch-default.xml (removal of mime.magic.file,
modification of mime.types.file to use the new mime-types.xml file)
* Patch for plugin protocol-ftp: It now uses the code provided in
mime-types.tar.gz instead of activation code.
* Patch for plugin protocol-file: It now uses the code provided in
mime-types.tar.gz instead of activation code.
* Patch for plugin index-more: It now uses the code provided in
mime-types.tar.gz but still uses activation code too.
* Patch for plugins build.xml for activating build of protocol-ftp and
protocol-file (there is no more license issue with activation).
Status:
I successfully build this whole contribution from the nutch trunk.
I successfully perform unit tests.
I successfully perform funtional tests.
Todo:
* Since activation is used in index-more plugin only for extracting primary and
subtype from a full content type string, I planned to provides such feature in
mime-types code.
* For others TODO, see TODO comments in the code.
> MIME content type detector (using magic char sequences)
> -------------------------------------------------------
>
> Key: NUTCH-33
> URL: http://issues.apache.org/jira/browse/NUTCH-33
> Project: Nutch
> Type: New Feature
> Reporter: Jerome Charron
> Priority: Minor
> Attachments: NUTCH-33.patch, mime-types.tar.gz
>
> Extension based content-type detector is not suffisant in some cases.
> The solution is to add a content type detector based on some magic char
> sequences like in apache httpd for instance.
> (Note: I created this issue only to keep a trace, but I'm currently working
> on it)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira