[ http://issues.apache.org/jira/browse/NUTCH-33?page=history ]

Jerome Charron updated NUTCH-33:
--------------------------------

    Attachment: NUTCH-33.patch
                mime-types.tar.gz

The attachements contains:

1. mime-types.tar.gz

This file contains:
* the java source code for a Mime Content Type / Extension / Magic bytes mapper.
* the java source code for unit tests + some tests files (gif, png, html, ...) 
for magic detection tests
* the conf file (mime-types.xml) that contains all these mappings (it includes 
those that were in the "old" mime.types file)
* a dtd that describes the schema of the mime-types.xml

2. NUTCH-33.patch

This file contains:
* Modifications on conf/nutch-default.xml (removal of mime.magic.file, 
modification of mime.types.file to use the new mime-types.xml file)
* Patch for plugin protocol-ftp: It now uses the code provided in 
mime-types.tar.gz instead of activation code.
* Patch for plugin protocol-file: It now uses the code provided in 
mime-types.tar.gz instead of activation code.
* Patch for plugin index-more: It now uses the code provided in 
mime-types.tar.gz but still uses activation code too.
* Patch for plugins build.xml for activating build of protocol-ftp and 
protocol-file (there is no more license issue with activation).

Status:
I successfully build this whole contribution from the nutch trunk.
I successfully perform unit tests.
I successfully perform funtional tests.

Todo:
* Since activation is used in index-more plugin only for extracting primary and 
subtype from a full content type string, I planned to provides such feature in 
mime-types code.
* For others TODO, see TODO comments in the code.


> MIME content type detector (using magic char sequences)
> -------------------------------------------------------
>
>          Key: NUTCH-33
>          URL: http://issues.apache.org/jira/browse/NUTCH-33
>      Project: Nutch
>         Type: New Feature
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-33.patch, mime-types.tar.gz
>
> Extension based content-type detector is not suffisant in some cases.
> The solution is to add a content type detector based on some magic char 
> sequences like in apache httpd for instance.
> (Note: I created this issue only to keep a trace, but I'm currently working 
> on it)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to