[ https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2549: ----------------------------------- Fix Version/s: 1.15 > protocol-http does not behave the same as browsers > -------------------------------------------------- > > Key: NUTCH-2549 > URL: https://issues.apache.org/jira/browse/NUTCH-2549 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.14 > Reporter: Gerard Bouchar > Priority: Major > Fix For: 1.15 > > Attachments: NUTCH-2549.patch > > > We identified the following issues in protocol-http (a plugin implementing > the HTTP protocol): > * It fails if an url's path does not start with '/' > ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers > correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries > to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*. > * It advertises its requests as being HTTP/1.0, but sends an > _Accept-Encoding_ request header, that is defined only in HTTP/1.1. This > confuses some web servers > ** Example: > [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm] > * If a server sends a redirection (3XX status code, with a Location header), > protocol-http tries to parse the HTTP response body anyway. Thus, if an error > occurs while decoding the body, the redirection is not followed and the > information is lost. Browsers follow the redirection and close the socket > soon as they can. > ** Example: [http://www.webarcelona.net/es/blog?page=2] > * Some servers invalidly send an HTTP body directly without a status line or > headers. Browsers handle that, protocol-http doesn't: > ** Example: [https://app.unitymedia.de/] > * Some servers invalidly add colons after the HTTP status code in the status > line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not > found_ for instance). Browsers can handle that. > * Some servers invalidly send headers that span over multiple lines. In that > case, browsers simply ignore the subsequent lines, but protocol-http throws > an error, thus preventing us from fetching the contents of the page. > * There is no limit over the size of the HTTP headers it reads. A bogus > server could send an infinite stream of different HTTP headers and cause the > fetcher to go out of memory, or send the same HTTP header repeatedly and > cause the fetcher to timeout. > * The same goes for the HTTP status line: no check is made concerning its > size. > * While reading chunked content, if the content size becomes larger than > {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#333333}bad chunk length' error.{color} > {color:#333333}Additionally (and that concerns protocol-httpclient as well), > when reading http headers, for each header, the SpellCheckedMetadata class > computes a Levenshtein distance between it and every known header in the > HttpHeaders interface. Not only is that slow, non-standard, and non-conform > to browsers' behavior, but it also causes bugs and prevents us from accessing > the real headers sent by the HTTP server.{color} > * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a > *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects > it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) > tries to read the HTTP body as chunked, whereas it is not.{color} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)