Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page

2011-02-27 Thread Xavier Roche
Le 24/02/2011 22:22, Petr Gajdůšek a écrit :
 21:22:45Info:   engine: warning: entry cleaned up, but no trace
 on heap:

This is only a warning, harmful. The other one (412/416 error) is not,
and should not happen.

 HTTP request (at time 218.329301 second)
 GET
 /~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png
 
 HTTP answer (at 218.356642 seconds)
 HTTP/1.1 416 Requested Range Not Satisfiable

Did this result in a 412/416 error ?

I tried to reproduce the issue by doing the following:
- partially mirrored the site and put it on a localhost for testing,
using default options of httrack
- partially mirrored the localhost site (default options, plus
--debug-headers and --debug-log), and killed httrack in the middle of
the download
- restarted the mirror

Could not get any 412/416 errors (many 416 errors in the hts-ioinfo.txt
log, but correctly handled by tne engine)

I also tried to restart the mirror, kill the engine, and restart again,
and I could not get any error either.

Are you using specific options while mirroring ? Any reliable step to
reproduce the issue ?

(I'm still trying to get the bug reproduced on my side)



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page

2011-02-27 Thread Xavier Roche
tag 614966 +patch
thanks

Okay, I managed to reproduce the bug, by killing a first update, and
then updating the mirror.

The issue was really painful to track (very random), and is related to
the delayed type check (don't make any link test but wait for files
download to start instead)

The bug will not occur with -%N0 (disabled delayed type checking)

Basically, httrack scans all html files sequentially, using a heap of
links. Each new link is recorded on the heap, and httrack processes all
links until no more link is found. HTML pages produces more links when
scanned, static data (images, ..) are just skipped. The process stops
basically when all encountered links have either been already added, or
are forbidden.

To enhance the process, a background downloader ensure that links can be
added regularly, and once finished, the entry is kicked from the
background heap, and the link heap is notified that the file was
processed in background (so that httrack can just skip this entry). It
means that the background downloaded must find the file reference on the
links heap, obviously.

The delayed type check option is a feature allowing to start the
download of a file before it is added on the link heap. It allows to
have the HTTP headers ready before the link name is generated, allowing
to have a correct file extension on disk (ie. www.example.com/foo.cgi
will be named foo.gif if this is an image), as local filesystem browsing
require files to have a correct type (because files do not have any mime
type meta-data attached otherwise)

This is obviously buggy, because there is a small race condition window
where the background downloader will finish to download the file, before
the link is added.

In this case, httrack will fail to find the link reference on the link
heap, and will display the cryptic:

Info:   engine: warning: entry cleaned up, but no trace on heap:  (...)

This will cause many troubles, including corrupted files in case of HTTP
retries with preconditions, and many other headaches.


Suggested patch that should fix this longstanding (and very painful) issue:

diff -rudb httrack-3.43.12.orig/src/htsback.c httrack-3.43.12/src/htsback.c
--- httrack-3.43.12.orig/src/htsback.c  2010-12-21 11:30:12.0 +0100
+++ httrack-3.43.12/src/htsback.c   2011-02-27 21:18:11.53158 +0100
@@ -2150,10 +2150,12 @@

 static int slot_can_be_finalized(httrackp* opt, const lien_back* back) {
   return
-(back-r.is_write // not in memory (on
disk, ready)
+back-r.is_write // not in memory (on
disk, ready)
  !is_hypertext_mime(opt,back-r.contenttype, back-url_fil)
  // not HTML/hypertext
  !may_be_hypertext_mime(opt,back-r.contenttype, back-url_fil)
  // may NOT be parseable mime type
-);
+/* Has not been added before the heap saw the link, or now exists
on heap */
+ ( !back-early_add ||
hash_read(opt-hash,back-url_sav,,0,opt-urlhack) = 0 )
+;
 }

 void back_clean(httrackp* opt,cache_back* cache,struct_back* sback) {
@@ -3243,7 +3245,7 @@
 /*
 Solve false 416 problems
 */
-if (back[i].r.statuscode==416) {  // 'Requested
Range Not Satisfiable'
+if
(back[i].r.statuscode==HTTP_REQUESTED_RANGE_NOT_SATISFIABLE) {  //
'Requested Range Not Satisfiable'
   // Example:
   // Range: bytes=2830-
   // -
diff -rudb httrack-3.43.12.orig/src/htscore.h httrack-3.43.12/src/htscore.h
--- httrack-3.43.12.orig/src/htscore.h  2010-12-21 11:30:13.0 +0100
+++ httrack-3.43.12/src/htscore.h   2011-02-27 21:07:51.514117000 +0100
@@ -207,6 +207,7 @@
   char info[256]; // éventuel status pour le ftp
   int stop_ftp;   // flag stop pour ftp
   int finalized;  // finalized (optim memory)
+  int early_add;  // was added before link heap saw it
 #if DEBUG_CHECKINT
   char magic2;
 #endif
diff -rudb httrack-3.43.12.orig/src/htshash.c httrack-3.43.12/src/htshash.c
--- httrack-3.43.12.orig/src/htshash.c  2010-12-21 11:30:13.0 +0100
+++ httrack-3.43.12/src/htshash.c   2011-02-27 20:20:09.714432000 +0100
@@ -63,7 +63,7 @@
 // type: numero enregistrement - 0 est case insensitive (sav) 1
(adr+fil) 2 (former_adr+former_fil)
 // recherche dans la table selon nom1,nom2 et le no d'enregistrement
 // retour: position ou -1 si non trouvé
-int hash_read(hash_struct* hash,char* nom1,char* nom2,int type,int
normalized) {
+int hash_read(const hash_struct* hash,char* nom1,char* nom2,int
type,int normalized) {
   char BIGSTK normfil_[HTS_URLMAXSIZE*2];
char catbuff[CATBUFF_SIZE];
   char* normfil;
diff -rudb httrack-3.43.12.orig/src/htshash.h httrack-3.43.12/src/htshash.h
--- httrack-3.43.12.orig/src/htshash.h  2010-12-21 11:30:13.0 +0100
+++ httrack-3.43.12/src/htshash.h   2011-02-27 20:20:47.289581000 +0100
@@ -50,7 +50,7 @@
 #endif

 // 

Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page

2011-02-24 Thread Petr Gajdůšek

Package: httrack
Version: 3.43.12-1
Severity: important

Hello,

I use httrack to mirror an intranet web. Content of some files (they 
change between each mirror) is replaced with HTTP 416 error message in 
HTML.


For each affected file there are two entries in the httrack log:
1) Info: engine: warning: entry cleaned up, but no trace on heap: URL 
(location of mirrored file)

2) Near the log tail:
Warning: Unexpected 412/416 error (Requested Range Not Satisfiable) for 
URL


Solutions I've found:
1) Delete affected files from mirror and make update. Repeat until there 
is no erroneous files.

2) increase max connections per second with:
--connection-per-second=(=10) --disable-security-limits
3) use -N%0 switch (this disable delayed type check)

The last solution works fine every time.

There is a thread in httrack forum with more info:
http://forum.httrack.com/readmsg/20479/

If this bug is not easy to fix, it would be nice if affected files will 
not be stored in the mirror (as update do not retry to download them) 
and treated as errors by httrack and/or make -N%0 switch default.


Cheers, Petr

--- System information. ---
Architecture: i386
Kernel:   Linux 2.6.37-1-686

Debian Release: wheezy/sid
  500 unstablewww.debian-multimedia.org
  500 unstableunofficial.debian-maintainers.org
  500 unstableftp.cz.debian.org

--- Package information. ---
Depends (Version) | Installed
=-+-==
libc6   (= 2.3.6-6~) | 2.11.2-11
libhttrack2   | 3.43.12-1
zlib1g   (= 1:1.1.4) | 1:1.2.3.4.dfsg-3


Package's Recommends field is empty.

Suggests (Version) | Installed
==-+-===
webhttrack | 3.43.12-1
httrack-doc| 3.43.12-1





--
S pozdravem,
Petr Gajdůšek



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page

2011-02-24 Thread Xavier Roche
Le 24/02/2011 15:17, Petr Gajdůšek a écrit :
 I use httrack to mirror an intranet web. Content of some files (they
 change between each mirror) is replaced with HTTP 416 error message in
 HTML.

Could you enable the debug header feature (--debug-headers) and report
both request and reply of a buggy page please ?




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#614966: [httrack] Content of some random files is replaced with HTTP 416 Error HTML page

2011-02-24 Thread Petr Gajdůšek

Dne 24.2.2011 18:06, Xavier Roche napsal(a):

Le 24/02/2011 15:17, Petr Gajdůšek a écrit :

I use httrack to mirror an intranet web. Content of some files (they
change between each mirror) is replaced with HTTP 416 error message in
HTML.


Could you enable the debug header feature (--debug-headers) and report
both request and reply of a buggy page please ?



Hi,

Sorry, I tried this but cannot find any information produced by 
--debug-headers. There is no additional entries in hts-log.txt nor in 
console with -v parameter.


Here are entries from wireshark for one of failed files:

HTTP request (at 13.835486 seconds)
GET 
/~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png 
HTTP/1.1

Referer: http://localhost/~petr/obce/testing/www.skrdla66.cz/
Cookie: $Version=1; lang=cz; $Path=/; accesible=off; $Path=/; 
category=0; $Path=/

Connection: Keep-Alive
Host: localhost
User-Agent: Mozilla/5.0 (X11; U; Linux i686; cs-CZ; rv:1.9.1.16) 
Gecko/20110107 Iceweasel/3.5.16 (like Firefox/3.5.16)
Accept: image/png, image/jpeg, image/pjpeg, image/x-xbitmap, 
image/svg+xml, image/gif;q=0.9, */*;q=0.1

Accept-Language: en, *
Accept-Charset: iso-8859-1, iso-8859-*;q=0.9, utf-8;q=0.66, *;q=0.33
Accept-Encoding: gzip, identity;q=0.9

HTTP answer (at 13.837991 seconds)
HTTP/1.1 200 OK
Date: Thu, 24 Feb 2011 20:22:45 GMT
Server: Apache/2.2.17 (Debian)
Last-Modified: Sun, 06 Feb 2011 00:48:11 GMT
ETag: 8079b-30f-49b92783348c0
Accept-Ranges: bytes
Content-Length: 783
Keep-Alive: timeout=15, max=79
Connection: Keep-Alive
Content-Type: image/png

... PNG data follows, I checked the file and it is correct.
Now this entry appears in hts-log.txt
21:22:45Info:   engine: warning: entry cleaned up, but no trace 
on heap: 
localhost/~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png 
(skrdla66_cd1/localhost/_petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png)


All other URLs (with some same failures) are processed and just before 
httrack exits failed files are retried:


HTTP request (at time 218.329301 second)
GET 
/~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Images/mb_Components/leftDisabled.png 
HTTP/1.1

If-Unmodified-Since: Sun, 06 Feb 2011 00:48:11 GMT
Range: bytes=783-
Referer: 
http://localhost/~petr/obce/testing/www.skrdla66.cz/engine/javascripts_new/multiBox/Styles/multiBox.css
Cookie: $Version=1; lang=cz; $Path=/; accesible=off; $Path=/; 
category=0; $Path=/

Connection: Keep-Alive
Host: localhost
User-Agent: Mozilla/5.0 (X11; U; Linux i686; cs-CZ; rv:1.9.1.16) 
Gecko/20110107 Iceweasel/3.5.16 (like Firefox/3.5.16)
Accept: image/png, image/jpeg, image/pjpeg, image/x-xbitmap, 
image/svg+xml, image/gif;q=0.9, */*;q=0.1

Accept-Language: en, *
Accept-Charset: iso-8859-1, iso-8859-*;q=0.9, utf-8;q=0.66, *;q=0.33
Accept-Encoding: gzip, identity;q=0.9

HTTP answer (at 218.356642 seconds)
HTTP/1.1 416 Requested Range Not Satisfiable
Date: Thu, 24 Feb 2011 20:26:09 GMT
Server: Apache/2.2.17 (Debian)
Vary: Accept-Encoding
Content-Encoding: gzip
Keep-Alive: timeout=15, max=51
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1

It is trying to get the content just after the end of file and already 
stored file content is replaced with HTTP error in HTML.


Cheers, Petr



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org