Gerard Bouchar created NUTCH-2563:
-------------------------------------
Summary: HTTP header spellchecking issues
Key: NUTCH-2563
URL: https://issues.apache.org/jira/browse/NUTCH-2563
Project: Nutch
Issue Type: Sub-task
Reporter: Gerard Bouchar
{color:#333333}When reading http headers, for each header, the
SpellCheckedMetadata class computes a Levenshtein distance between it and every
known header in the HttpHeaders interface. Not only is that slow,
non-standard, and non-conform to browsers' behavior, but it also causes bugs
and prevents us from accessing the real headers sent by the HTTP server.{color}
* {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a
*Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects
it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) tries
to read the HTTP body as chunked, whereas it is not.{color}
{color:#333333}I personally think that HTTP header spell checking is a bad
idea, and that this logic should be completely removed. But if it were to be
kept, the threshold (SpellCheckedMetadata.TRESHOLD_DIVIDER) should be higher
(we internally set it to 5 as a temporary fix for this issue){color}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)