Hudson commented on NUTCH-2563:

SUCCESS: Integrated in Jenkins build Nutch-trunk #3534 (See 
NUTCH-2563 HTTP header spellchecking issues ("Client-Transfer-Encoding" 
* (edit) src/java/org/apache/nutch/metadata/HttpHeaders.java
* (edit) 
* (edit) src/java/org/apache/nutch/metadata/SpellCheckedMetadata.java

> HTTP header spellchecking issues
> --------------------------------
>                 Key: NUTCH-2563
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2563
>             Project: Nutch
>          Issue Type: Sub-task
>    Affects Versions: 1.14
>            Reporter: Gerard Bouchar
>            Priority: Major
>             Fix For: 1.15
> {color:#333333}When reading http headers, for each header, the 
> SpellCheckedMetadata class computes a Levenshtein distance between it and 
> every  known header in the HttpHeaders interface. Not only is that slow, 
> non-standard, and non-conform to browsers' behavior, but it also causes bugs 
> and prevents us from accessing the real headers sent by the HTTP 
> server.{color}
>  * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
> {color:#333333}I personally think that HTTP header spell checking is a bad 
> idea, and that this logic should be completely removed. But if it were to be 
> kept, the threshold (SpellCheckedMetadata.TRESHOLD_DIVIDER) should be higher 
> (we internally set it to 5 as a temporary fix for this issue){color}

This message was sent by Atlassian JIRA

Reply via email to