Hello: This patch prevent wget from stopping when -nc argument is in use and file in the disk (from a previous download) has a name which doesn't finish with htm or html.
I detected this bug when downloading a website with this URL: http://www.abc.com/dirdir/cgi-bin/Search.php?lng=EN&search=Query_Search_List This was saved in the disk at: dirdir/cgi-bin/ as a file named: Search.php?lng=EN&search=Query_Search_List When I was using -nc argument, wget couldn't detect that this file could have links inside. Because I believe wget should check most of the files as text/html just in case there are some links inside to visit, I repaired this problem not checking for htm or html suffix inside the name of the file. Sincerely, -- Juan Miguel Taboada Godoy Centrologic (Computational Logistic Center) [email protected] - http://www.centrologic.com (PGP key: 0xBF597018) ----------------------------------------------------------------------- La legislación española ampara el secreto de las comunicaciones. Este mensaje se dirige exclusivamente a su destinatario y puede contener información privilegiada o CONFIDENCIAL. Si no es vd. el destinatario indicado, queda notificado de que la utilización, divulgación y/o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción. Spanish law guarantees privacy in electronic communications. This message is intended exclusively for its addressee and may contain information that is CONFIDENTIAL and protected by professional privilege. If you are not the intended recipient you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited by law. If this message has been received in error, please immediately notify us via e-mail and delete it. -----------------------------------------------------------------------
--- src/http.c.orig 2013-03-17 23:39:06.757306072 +0100
+++ src/http.c 2013-03-17 23:36:57.001463866 +0100
@@ -1486,9 +1486,14 @@
/* If the file is there, we suppose it's retrieved OK. */
*dt |= RETROKF;
- /* #### Bogusness alert. */
- /* If its suffix is "html" or "htm" or similar, assume text/html. */
- if (has_html_suffix_p (filename))
+ // Since URL is usually not finishing with htm or html we
+ // assume the file it may be text/html file so we will be
+ // sure to check links to other pages (this happens when the
+ // downloaded page is kind of foo.php?abc=def&ghi=jk with
+ // this name we don't know if the resultant file is text/html
+ // or something else. This is even more unpredictible when
+ // the website has friendly URLs like /foo/abc/def/ghi/jk
+ // So we assume every file is text/html
*dt |= TEXTHTML;
}
signature.asc
Description: OpenPGP digital signature
