Jérôme,
Thank you for the explanation.

        Here is an easy way to reproduce what I mean by content-type 
inconsistency:
        1. Perform a crawl of the following URL : 
http://jerome.charron.free.fr/nutch/fake.zip
        (fake.zip is a fake zip file, in fact it is a html one)
        2. While crawling, you can see that the content-type returned by the 
server is application/zip 
        3. But you can see that Nutch correctly guess the content-type to 
text/html (it uses the HtmlParser)
        4. At this step, all is ok.
        5. Then start your tomcat and try the following search : zip
        6. You can see the fake.zip file in results. Click on details ; if the 
index-more plugin was activated then you can see that the stored content-type 
is application/zip and not text/html
        
        What I suggest is simply to use the content-type used by nutch to find 
which parser to use instead of the one returned by the server. 

I'm not sure if that is the right thing.
If the site administrator did a poort job and a wrong media type is advertized, 
it's the site 
problem and Nutch shouldn't be fixing it, in my opinion.  Those sites would
not work properly with the browsers any way, and Nutch doesn't need to work 
properly
except that it should protect itself from crashing.  I tried to visit your 
fake.zip page with 
IE and Firefox, and both faithfully trusted the media type as advertised by the 
server, and 
asked me if I want to open it with WinZip or save it; there was no option to 
open it as an HTML.  
Why should Nutch treat it as HTML? Sorry but I don't see a practical value 
here. 

-kuro


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to