Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page
Le 24/02/2011 22:22, Petr Gajdůšek a écrit : 21:22:45Info: engine: warning: entry cleaned up, but no trace on heap: This is only a warning, harmful. The other one (412/416 error) is not, and should not happen. HTTP request (at time 218.329301 second) GET /~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png HTTP answer (at 218.356642 seconds) HTTP/1.1 416 Requested Range Not Satisfiable Did this result in a 412/416 error ? I tried to reproduce the issue by doing the following: - partially mirrored the site and put it on a localhost for testing, using default options of httrack - partially mirrored the localhost site (default options, plus --debug-headers and --debug-log), and killed httrack in the middle of the download - restarted the mirror Could not get any 412/416 errors (many 416 errors in the hts-ioinfo.txt log, but correctly handled by tne engine) I also tried to restart the mirror, kill the engine, and restart again, and I could not get any error either. Are you using specific options while mirroring ? Any reliable step to reproduce the issue ? (I'm still trying to get the bug reproduced on my side) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page
tag 614966 +patch thanks Okay, I managed to reproduce the bug, by killing a first update, and then updating the mirror. The issue was really painful to track (very random), and is related to the delayed type check (don't make any link test but wait for files download to start instead) The bug will not occur with -%N0 (disabled delayed type checking) Basically, httrack scans all html files sequentially, using a heap of links. Each new link is recorded on the heap, and httrack processes all links until no more link is found. HTML pages produces more links when scanned, static data (images, ..) are just skipped. The process stops basically when all encountered links have either been already added, or are forbidden. To enhance the process, a background downloader ensure that links can be added regularly, and once finished, the entry is kicked from the background heap, and the link heap is notified that the file was processed in background (so that httrack can just skip this entry). It means that the background downloaded must find the file reference on the links heap, obviously. The delayed type check option is a feature allowing to start the download of a file before it is added on the link heap. It allows to have the HTTP headers ready before the link name is generated, allowing to have a correct file extension on disk (ie. www.example.com/foo.cgi will be named foo.gif if this is an image), as local filesystem browsing require files to have a correct type (because files do not have any mime type meta-data attached otherwise) This is obviously buggy, because there is a small race condition window where the background downloader will finish to download the file, before the link is added. In this case, httrack will fail to find the link reference on the link heap, and will display the cryptic: Info: engine: warning: entry cleaned up, but no trace on heap: (...) This will cause many troubles, including corrupted files in case of HTTP retries with preconditions, and many other headaches. Suggested patch that should fix this longstanding (and very painful) issue: diff -rudb httrack-3.43.12.orig/src/htsback.c httrack-3.43.12/src/htsback.c --- httrack-3.43.12.orig/src/htsback.c 2010-12-21 11:30:12.0 +0100 +++ httrack-3.43.12/src/htsback.c 2011-02-27 21:18:11.53158 +0100 @@ -2150,10 +2150,12 @@ static int slot_can_be_finalized(httrackp* opt, const lien_back* back) { return -(back-r.is_write // not in memory (on disk, ready) +back-r.is_write // not in memory (on disk, ready) !is_hypertext_mime(opt,back-r.contenttype, back-url_fil) // not HTML/hypertext !may_be_hypertext_mime(opt,back-r.contenttype, back-url_fil) // may NOT be parseable mime type -); +/* Has not been added before the heap saw the link, or now exists on heap */ + ( !back-early_add || hash_read(opt-hash,back-url_sav,,0,opt-urlhack) = 0 ) +; } void back_clean(httrackp* opt,cache_back* cache,struct_back* sback) { @@ -3243,7 +3245,7 @@ /* Solve false 416 problems */ -if (back[i].r.statuscode==416) { // 'Requested Range Not Satisfiable' +if (back[i].r.statuscode==HTTP_REQUESTED_RANGE_NOT_SATISFIABLE) { // 'Requested Range Not Satisfiable' // Example: // Range: bytes=2830- // - diff -rudb httrack-3.43.12.orig/src/htscore.h httrack-3.43.12/src/htscore.h --- httrack-3.43.12.orig/src/htscore.h 2010-12-21 11:30:13.0 +0100 +++ httrack-3.43.12/src/htscore.h 2011-02-27 21:07:51.514117000 +0100 @@ -207,6 +207,7 @@ char info[256]; // éventuel status pour le ftp int stop_ftp; // flag stop pour ftp int finalized; // finalized (optim memory) + int early_add; // was added before link heap saw it #if DEBUG_CHECKINT char magic2; #endif diff -rudb httrack-3.43.12.orig/src/htshash.c httrack-3.43.12/src/htshash.c --- httrack-3.43.12.orig/src/htshash.c 2010-12-21 11:30:13.0 +0100 +++ httrack-3.43.12/src/htshash.c 2011-02-27 20:20:09.714432000 +0100 @@ -63,7 +63,7 @@ // type: numero enregistrement - 0 est case insensitive (sav) 1 (adr+fil) 2 (former_adr+former_fil) // recherche dans la table selon nom1,nom2 et le no d'enregistrement // retour: position ou -1 si non trouvé -int hash_read(hash_struct* hash,char* nom1,char* nom2,int type,int normalized) { +int hash_read(const hash_struct* hash,char* nom1,char* nom2,int type,int normalized) { char BIGSTK normfil_[HTS_URLMAXSIZE*2]; char catbuff[CATBUFF_SIZE]; char* normfil; diff -rudb httrack-3.43.12.orig/src/htshash.h httrack-3.43.12/src/htshash.h --- httrack-3.43.12.orig/src/htshash.h 2010-12-21 11:30:13.0 +0100 +++ httrack-3.43.12/src/htshash.h 2011-02-27 20:20:47.289581000 +0100 @@ -50,7 +50,7 @@ #endif //
Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page
Package: httrack Version: 3.43.12-1 Severity: important Hello, I use httrack to mirror an intranet web. Content of some files (they change between each mirror) is replaced with HTTP 416 error message in HTML. For each affected file there are two entries in the httrack log: 1) Info: engine: warning: entry cleaned up, but no trace on heap: URL (location of mirrored file) 2) Near the log tail: Warning: Unexpected 412/416 error (Requested Range Not Satisfiable) for URL Solutions I've found: 1) Delete affected files from mirror and make update. Repeat until there is no erroneous files. 2) increase max connections per second with: --connection-per-second=(=10) --disable-security-limits 3) use -N%0 switch (this disable delayed type check) The last solution works fine every time. There is a thread in httrack forum with more info: http://forum.httrack.com/readmsg/20479/ If this bug is not easy to fix, it would be nice if affected files will not be stored in the mirror (as update do not retry to download them) and treated as errors by httrack and/or make -N%0 switch default. Cheers, Petr --- System information. --- Architecture: i386 Kernel: Linux 2.6.37-1-686 Debian Release: wheezy/sid 500 unstablewww.debian-multimedia.org 500 unstableunofficial.debian-maintainers.org 500 unstableftp.cz.debian.org --- Package information. --- Depends (Version) | Installed =-+-== libc6 (= 2.3.6-6~) | 2.11.2-11 libhttrack2 | 3.43.12-1 zlib1g (= 1:1.1.4) | 1:1.2.3.4.dfsg-3 Package's Recommends field is empty. Suggests (Version) | Installed ==-+-=== webhttrack | 3.43.12-1 httrack-doc| 3.43.12-1 -- S pozdravem, Petr Gajdůšek -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page
Le 24/02/2011 15:17, Petr Gajdůšek a écrit : I use httrack to mirror an intranet web. Content of some files (they change between each mirror) is replaced with HTTP 416 error message in HTML. Could you enable the debug header feature (--debug-headers) and report both request and reply of a buggy page please ? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page
Dne 24.2.2011 18:06, Xavier Roche napsal(a): Le 24/02/2011 15:17, Petr Gajdůšek a écrit : I use httrack to mirror an intranet web. Content of some files (they change between each mirror) is replaced with HTTP 416 error message in HTML. Could you enable the debug header feature (--debug-headers) and report both request and reply of a buggy page please ? Hi, Sorry, I tried this but cannot find any information produced by --debug-headers. There is no additional entries in hts-log.txt nor in console with -v parameter. Here are entries from wireshark for one of failed files: HTTP request (at 13.835486 seconds) GET /~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png HTTP/1.1 Referer: http://localhost/~petr/obce/testing/www.skrdla66.cz/ Cookie: $Version=1; lang=cz; $Path=/; accesible=off; $Path=/; category=0; $Path=/ Connection: Keep-Alive Host: localhost User-Agent: Mozilla/5.0 (X11; U; Linux i686; cs-CZ; rv:1.9.1.16) Gecko/20110107 Iceweasel/3.5.16 (like Firefox/3.5.16) Accept: image/png, image/jpeg, image/pjpeg, image/x-xbitmap, image/svg+xml, image/gif;q=0.9, */*;q=0.1 Accept-Language: en, * Accept-Charset: iso-8859-1, iso-8859-*;q=0.9, utf-8;q=0.66, *;q=0.33 Accept-Encoding: gzip, identity;q=0.9 HTTP answer (at 13.837991 seconds) HTTP/1.1 200 OK Date: Thu, 24 Feb 2011 20:22:45 GMT Server: Apache/2.2.17 (Debian) Last-Modified: Sun, 06 Feb 2011 00:48:11 GMT ETag: 8079b-30f-49b92783348c0 Accept-Ranges: bytes Content-Length: 783 Keep-Alive: timeout=15, max=79 Connection: Keep-Alive Content-Type: image/png ... PNG data follows, I checked the file and it is correct. Now this entry appears in hts-log.txt 21:22:45Info: engine: warning: entry cleaned up, but no trace on heap: localhost/~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png (skrdla66_cd1/localhost/_petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png) All other URLs (with some same failures) are processed and just before httrack exits failed files are retried: HTTP request (at time 218.329301 second) GET /~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png HTTP/1.1 If-Unmodified-Since: Sun, 06 Feb 2011 00:48:11 GMT Range: bytes=783- Referer: http://localhost/~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Styles/multiBox.css Cookie: $Version=1; lang=cz; $Path=/; accesible=off; $Path=/; category=0; $Path=/ Connection: Keep-Alive Host: localhost User-Agent: Mozilla/5.0 (X11; U; Linux i686; cs-CZ; rv:1.9.1.16) Gecko/20110107 Iceweasel/3.5.16 (like Firefox/3.5.16) Accept: image/png, image/jpeg, image/pjpeg, image/x-xbitmap, image/svg+xml, image/gif;q=0.9, */*;q=0.1 Accept-Language: en, * Accept-Charset: iso-8859-1, iso-8859-*;q=0.9, utf-8;q=0.66, *;q=0.33 Accept-Encoding: gzip, identity;q=0.9 HTTP answer (at 218.356642 seconds) HTTP/1.1 416 Requested Range Not Satisfiable Date: Thu, 24 Feb 2011 20:26:09 GMT Server: Apache/2.2.17 (Debian) Vary: Accept-Encoding Content-Encoding: gzip Keep-Alive: timeout=15, max=51 Connection: Keep-Alive Transfer-Encoding: chunked Content-Type: text/html; charset=iso-8859-1 It is trying to get the content just after the end of file and already stored file content is replaced with HTTP error in HTML. Cheers, Petr -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org