[ 
https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Bouchar updated NUTCH-2549:
----------------------------------
    Description: 
We identified the following issues in protocol-http (a plugin implementing the 
HTTP protocol):
 * It fails if an url's path does not start with '/'
 ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries to 
send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
 * It advertises its requests as being HTTP/1.0, but send the 
{color:#6a8759}Accept-Encoding{color} request header, that is defined only in 
HTTP/1.1. This confuses some web servers
 ** Example: 
[http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
 * If a server send sends a redirection (3XX status code, with a Location 
header), protocol-http tries to parse the HTTP response body anyway. Thus, if 
an error occurs while decoding the body, the redirection is not followed and 
the information is lost. Browsers follow the redirection and close the socket 
soon as they can.
 ** Example: [http://www.webarcelona.net/es/blog?page=2]
 * Some servers invalidly send an HTTP body directly without a status line or 
headers. Browsers handle that, protocol-http doesn't:
 ** Example: [https://app.unitymedia.de/]
 * Some servers invalidly add colons after the HTTP status code in the status 
line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
found_ for instance). Browsers can handle that.
 * Some servers invalidly send headers that span over multiple line. In that 
case, browsers simply ignore the subsequent lines, but protocol-http throws an 
error, thus preventing us from fetching the contents of the page.
 * There is no limit over the size of the HTTP headers it reads. A bogus server 
could send an infinite stream of different HTTP headers and cause the fetcher 
to go out of memory, or send the same HTTP header repeatedly and cause the 
fetcher to timeout.
 * The same goes for the HTTP status line: no check is made concerning its size.
 * While reading chunked content, if the content size becomes larger than 
{color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it tries 
to read a new chunk before having read the previous one completely, resulting 
in a '{color:#333333}bad chunk length' error.{color}

{color:#333333}Additionally (and that concerns protocol-httpclient as well), 
when reading http headers, for each header, the SpellCheckedMetadata class 
computes a Levenshtein distance between it and every  known header in the 
HttpHeaders interface. Not only is that slow, non-standard, and non-conform to 
browsers' behavior, but it also causes bugs and prevents us from accessing the 
real headers sent by the HTTP server.{color}
 * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a 
*Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) tries 
to read the HTTP body as chunked, whereas it is not.{color}

 

  was:
We identified the following issues in protocol-http (a plugin implementing the 
HTTP protocol):
 * It fails if an url's path does not start with '/'
 ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries to 
send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
 * It advertises its requests as being HTTP/1.0, but send the 
{color:#6a8759}Accept-Encoding{color} request header, that is defined only in 
HTTP/1.1. This confuses some web servers
 ** Example: 
[http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
 * If a server send sends a redirection (3XX status code, with a Location 
header), protocol-http tries to parse the HTTP response body anyway. Thus, if 
an error occurs while decoding the body, the redirection is not followed and 
the information is lost. Browsers follow the redirection and close the socket 
soon as they can.
 ** Example: [http://www.webarcelona.net/es/blog?page=2]
 * Some servers invalidly send an HTTP body directly without a status line or 
headers. Browsers handle that, protocol-http doesn't:
 ** Example: [https://app.unitymedia.de/]
 * Some servers invalidly add colons after the HTTP status code in the status 
line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
found_ for instance). Browsers can handle that.
 * Some servers invalidly send headers that span over multiple line. In that 
case, browsers simply ignore the subsequent lines, but protocol-http throws an 
error, thus preventing us to fetch the contents of the page.
 * There is no limit over the size of the HTTP headers it reads. A bogus server 
could send an infinite stream of different HTTP headers and cause the fetcher 
to go out of memory, or send the same HTTP header repeatedly and cause the 
fetcher to timeout.
 * The same goes for the HTTP status line: no check is made concerning its size.
 * While reading chunked content, if the content size becomes larger than 
{color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it tries 
to read a new chunk before having read the previous one completely, resulting 
in a '{color:#333333}bad chunk length' error.{color}

{color:#333333}Additionally (and that concerns protocol-httpclient as well), 
when reading http headers, for each header, the SpellCheckedMetadata class 
computes a Levenshtein distance between it and every  known header in the 
HttpHeaders interface. Not only is that slow, non-standard, and non-conform to 
browsers' behavior, but it also causes bugs and prevents us from accessing the 
real headers sent by the HTTP server.{color}
 * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a 
*Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) tries 
to read the HTTP body as chunked, whereas it is not.{color}

 


> protocol-http does not behave the same as browsers
> --------------------------------------------------
>
>                 Key: NUTCH-2549
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2549
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> We identified the following issues in protocol-http (a plugin implementing 
> the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers 
> correctly rewrite the url as [http://news.fx678.com/?171], while nutch tries 
> to send an invalid HTTP request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but send the 
> {color:#6a8759}Accept-Encoding{color} request header, that is defined only in 
> HTTP/1.1. This confuses some web servers
>  ** Example: 
> [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server send sends a redirection (3XX status code, with a Location 
> header), protocol-http tries to parse the HTTP response body anyway. Thus, if 
> an error occurs while decoding the body, the redirection is not followed and 
> the information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or 
> headers. Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status 
> line (they can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not 
> found_ for instance). Browsers can handle that.
>  * Some servers invalidly send headers that span over multiple line. In that 
> case, browsers simply ignore the subsequent lines, but protocol-http throws 
> an error, thus preventing us from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus 
> server could send an infinite stream of different HTTP headers and cause the 
> fetcher to go out of memory, or send the same HTTP header repeatedly and 
> cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its 
> size.
>  * While reading chunked content, if the content size becomes larger than 
> {color:#9876aa}http{color}.getMaxContent(), instead of just stopping, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#333333}bad chunk length' error.{color}
> {color:#333333}Additionally (and that concerns protocol-httpclient as well), 
> when reading http headers, for each header, the SpellCheckedMetadata class 
> computes a Levenshtein distance between it and every  known header in the 
> HttpHeaders interface. Not only is that slow, non-standard, and non-conform 
> to browsers' behavior, but it also causes bugs and prevents us from accessing 
> the real headers sent by the HTTP server.{color}
>  * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a 
> *Client-Transfer-Encoding: chunked* header, but SpellCheckedMetadata corrects 
> it to *Transfer-Encoding: chunked*. Then, HttpResponse (in protocol-http) 
> tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to