[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]

Stefan Groschupf updated NUTCH-135:
-----------------------------------

    Attachment: contentProperties_patch.txt

As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that 
solve the problem of case insensitive http header or general case insensitve 
content meta data. 
In general I see  two different ways to solve the problem. First leave the API 
as it is and extend a Properties object to overwriting its methods by using 
behind the sence a TreeMap. This solution would also require to copy some data 
between the properties object and treemap back and for several times, since the 
nutch code uses a Properties object in the content  constructor. The other 
choice would be to change the API of the content object to cleanly document 
that a other object, that has a different behavior than the properties object 
is used. The negative thing on this solution is that there are many small 
changes in the nutch code base. 
However I decide for a clean way, the last way, since I don't like code that 
does some things behind the sence that  developers would not expect. So I 
introduced a tiny ContentProperties object and changed the Content construtor 
to use the ContentProperties object instead of the java.util.Properties object. 
The new ContentProperties has a similar API as the Properties class but use 
case insensitve keys. I changed all classes that use the content object to use 
the new ContentProperties until object instantiation and I also extend the 
Content test case to test if case insensitive keys are now supported. 
Feel free to give constructive improvement suggestions, but also please let get 
us this done as soon as possible since from my point of view this is a critical 
issue.  All testcases pass on my box, but please double check before commiting.

> http header meta data are case insensitive in the real world (e.g. 
> Content-Type or content-type)
> ------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-135
>          URL: http://issues.apache.org/jira/browse/NUTCH-135
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7.1, 0.7
>     Reporter: Stefan Groschupf
>     Priority: Critical
>      Fix For: 0.8-dev, 0.7.2-dev
>  Attachments: contentProperties_patch.txt
>
> As described in issue nutch-133, some webservers return http header meta data 
> not standard conform case insensitive.
> This provides many negative side effects, for example query thet content type 
> from the meta data return null also in case the webserver returns a content 
> type, but the key is not standard conform e.g. lower case. Also this has 
> effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to