[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]
Stefan Groschupf updated NUTCH-135:
-----------------------------------
Attachment: contentProperties_patch.txt
As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that
solve the problem of case insensitive http header or general case insensitve
content meta data.
In general I see two different ways to solve the problem. First leave the API
as it is and extend a Properties object to overwriting its methods by using
behind the sence a TreeMap. This solution would also require to copy some data
between the properties object and treemap back and for several times, since the
nutch code uses a Properties object in the content constructor. The other
choice would be to change the API of the content object to cleanly document
that a other object, that has a different behavior than the properties object
is used. The negative thing on this solution is that there are many small
changes in the nutch code base.
However I decide for a clean way, the last way, since I don't like code that
does some things behind the sence that developers would not expect. So I
introduced a tiny ContentProperties object and changed the Content construtor
to use the ContentProperties object instead of the java.util.Properties object.
The new ContentProperties has a similar API as the Properties class but use
case insensitve keys. I changed all classes that use the content object to use
the new ContentProperties until object instantiation and I also extend the
Content test case to test if case insensitive keys are now supported.
Feel free to give constructive improvement suggestions, but also please let get
us this done as soon as possible since from my point of view this is a critical
issue. All testcases pass on my box, but please double check before commiting.
> http header meta data are case insensitive in the real world (e.g.
> Content-Type or content-type)
> ------------------------------------------------------------------------------------------------
>
> Key: NUTCH-135
> URL: http://issues.apache.org/jira/browse/NUTCH-135
> Project: Nutch
> Type: Bug
> Components: fetcher
> Versions: 0.7.1, 0.7
> Reporter: Stefan Groschupf
> Priority: Critical
> Fix For: 0.8-dev, 0.7.2-dev
> Attachments: contentProperties_patch.txt
>
> As described in issue nutch-133, some webservers return http header meta data
> not standard conform case insensitive.
> This provides many negative side effects, for example query thet content type
> from the meta data return null also in case the webserver returns a content
> type, but the key is not standard conform e.g. lower case. Also this has
> effects to the pdf parser that queries the content length etc.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira