RE: [Nutch-dev] Fetch / Parse errors and a Bug

Xin-Yi Liu Thu, 30 Dec 2004 14:56:01 -0800

I believe that the String.CASE_INSENSITIVE_ORDER
comparator only affects the way the keys are ordered
internally within the TreeMap.  It would not affect
lookups, so headers.get(key) would still be a case
sensitive.


Perhaps subclassing Properties to make all gets and
puts case insensitive is the best solution.

--- Sven Wende <[EMAIL PROTECTED]> wrote:

> I think it�s only case insensitive in that TreeMap
> parserHeaders() produces
> !
> 
> But in the constructor this map is copied into a
> Properties object.
> 
> ********************************************
>       // parse headers
>       headers.putAll(parseHeaders(in, line));
> ********************************************
> 
> Look at the following snippet, which does the same
> thing as
> HttpResponse.class does:
> 
> ********************************************
>     public static void main(String[] args) {
>         TreeMap headers = new
> TreeMap(String.CASE_INSENSITIVE_ORDER);
>         headers.put("content-type", "text");
> 
>         Properties headers2 = new Properties();
>         headers2.putAll(headers);
> 
>        
> System.out.println(headers.get("Content-Type"));  //
> = "text"
>         
>        
> System.out.println(headers2.get("Content-Type")); //
> = null
>     }
> ********************************************
> 
> You can use the following url for your tests:
>       
>       http://www.verdi.de/0x0ac80f2b_0x0069a759
> 
> It is a PDF file and the server sends "Content-type:
> application/pdf" !
> 
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED]
> 
> >
>
[mailto:[EMAIL PROTECTED]
> On 
> > Behalf Of Chirag Chaman
> > Sent: Mittwoch, 29. Dezember 2004 16:22
> > To: [EMAIL PROTECTED]
> > Subject: RE: [Nutch-dev] Fetch / Parse errors and
> a Bug
> > 
> > That is strange, coz I would expect it to be case 
> > insensitive, but then again I have not tested,
> just looking 
> > at the code.
> > 
> > You see how the TreeMap is initialized with 
> > String.CASE_INSENSITIVE_ORDER
> > 
> > private Map parseHeaders(PushbackInputStream in,
> StringBuffer line)
> >     throws IOException, HttpException {
> >     TreeMap headers = new
> TreeMap(String.CASE_INSENSITIVE_ORDER);
> >     return parseHeaders(in, line, headers); 
> > 
> > So I would imagine that a look up for Content-Type
> is case 
> > insensitive as well.
> > 
> > 
> > Can you send me the link to a page that has this
> problem -- 
> > I'll run some tests to see what's causing this.
> > 
> > 
> > -----Original Message-----
> > From: [EMAIL PROTECTED]
> >
>
[mailto:[EMAIL PROTECTED]
> On 
> > Behalf Of Sven Wende
> > Sent: Wednesday, December 29, 2004 9:16 AM
> > To: [EMAIL PROTECTED]
> > Subject: RE: [Nutch-dev] Fetch / Parse errors and
> a Bug
> > 
> > Chirag:
> > 
> > > I looked at where you mention that the content
> type is 
> > being looked up 
> > > and is Case Sensitive -- that is not correct.
> The HTTP protocol is 
> > > adding the Content-type to the TreeMap which is
> initialized 
> > with the 
> > > String.CASE_INSENSITIVE_ORDER comparator. Thus
> it 
> > internally will do a 
> > > case-insensitive match.
> > 
> > Which code do you refer to?
> > 
> > I described a problem in the protocoll-http
> plugin. Just take 
> > a look at the following code snippet from the CVS.
> As you can 
> > see, the headers are read in and stored in a
> simple Hashtable. 
> > The problem with case sensitive headers for
> content-type occurs in the
> > toContent() method. (for example)
> > 
> >
>
**************************************************************
> > **************
> > *****
> > package net.nutch.protocol.http;
> > 
> > /** An HTTP response. */
> > 
> > public class HttpResponse {
> >   private Properties headers = new Properties();  
>                  
> > 
> >   /** Returns the value of a named header. */
> >   public String getHeader(String name) {
> >     return (String)headers.get(name);
> >   }
> > 
> >   public Content toContent() {
> >     String contentType =
> getHeader("Content-Type");
> >     if (contentType == null)
> >       contentType = "";
> >     return new Content(orig, base, content,
> contentType, headers);
> >   }
> > 
> >   private void processHeaderLine(StringBuffer
> line, TreeMap headers)
> >     throws IOException, HttpException {
> >     int colonIndex = line.indexOf(":");       //
> key is up to colon
> >     if (colonIndex == -1) {
> >       int i;
> >       for (i= 0; i < line.length(); i++)
> >         if
> (!Character.isWhitespace(line.charAt(i)))
> >           break;
> >       if (i == line.length())
> >         return;
> >       throw new HttpException("No colon in
> header:" + line);
> >     }
> >     String key = line.substring(0, colonIndex);
> > 
> >     int valueStart = colonIndex+1;            //
> skip whitespace
> >     while (valueStart < line.length()) {
> >       int c = line.charAt(valueStart);
> >       if (c != ' ' && c != '\t')
> >         break;
> >       valueStart++;
> >     }
> >     String value = line.substring(valueStart);
> > 
> >     headers.put(key, value);
> >   }
> > }
> >
>
**************************************************************
> > **************
> > *****
> > 
> > > I think the problem is that no "content-type"
> was ever on 
> > the page -- 
> > > this leaves both the content type and the
> extension/suffix 
> > to be blank 
> > > and that causes a problem. Also, if a
> character-set is also not 
> > > specified then the fetcher fails as well (as it
> cannot 
> > write to disk).
> > 
> > I tested it and there was a "content-type" header.
> If its 
> > name was "Content-Type", everything was ok but if
> its name 
> > was "content-type" Nutch internally looses the
> information 
> > about the content-type by the use of the code
> above.
> > 
> 
=== message truncated ===



                
__________________________________ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com 
 



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

RE: [Nutch-dev] Fetch / Parse errors and a Bug

Reply via email to