Re: Only follow paths with /res/ in them

2008-11-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brian wrote:
 I would like to follow all the urls on a site that contain /res/ in the
 path. I've tried using -I and -A, with values such as res, *res*,
 */res/*, etc.. Here is an example that downloads pretty much the entire
 site, rather than what I appear  (to me) to have specified:
 
 wget -O- -q http://img.site.org/b/imgboard.html | wget -q -r -l1 -O- -I
 '*res*' -A '*res*' --force-html -B http://img.site.org/b/ -i-
 
 The urls I would like to follow and output to the command line are of
 the form:
 
 http://img.site.org/b/res/97867797.html

- -A isn't useful here: it's applied only against the filename portion
of the URL.

- -I is what you want; the trouble is that the * wildcard doesn't match
slashes (there's plans to introduce a ** wildcard, probably in 1.13). So
unfortunately you gotta do -I'res,*/res,*/*/res' etc as needed.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkk7awACgkQ7M8hyUobTrG2wgCeMUN3EnnY2VsmNzQTWOleZKqg
ZQYAn1CYoQ7JVc4OYfwLzcPVkai93UQc
=3I6Z
-END PGP SIGNATURE-


Re: Only follow paths with /res/ in them

2008-11-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Oh! Please don't use this list (wget@sunsite.dk) any more; I'm trying to
get the dotsrc folks to make it go away/forward to bug-wget (I need to
ping 'em on this again). The official list for Wget is now [EMAIL PROTECTED]

Micah Cowan wrote:
 Brian wrote:
 I would like to follow all the urls on a site that contain /res/ in the
 path. I've tried using -I and -A, with values such as res, *res*,
 */res/*, etc.. Here is an example that downloads pretty much the entire
 site, rather than what I appear  (to me) to have specified:
 
 wget -O- -q http://img.site.org/b/imgboard.html | wget -q -r -l1 -O- -I
 '*res*' -A '*res*' --force-html -B http://img.site.org/b/ -i-
 
 The urls I would like to follow and output to the command line are of
 the form:
 
 http://img.site.org/b/res/97867797.html
 
 -A isn't useful here: it's applied only against the filename portion
 of the URL.
 
 -I is what you want; the trouble is that the * wildcard doesn't match
 slashes (there's plans to introduce a ** wildcard, probably in 1.13). So
 unfortunately you gotta do -I'res,*/res,*/*/res' etc as needed.
 

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkk7j0ACgkQ7M8hyUobTrH+CACbBzcO4vM6qHIumBeDS2ZyAdfq
ONYAnjX7SHAOvEJylkbjjq7IsDXEv+27
=3Hrq
-END PGP SIGNATURE-


Re: MAILING LIST IS MOVING: [EMAIL PROTECTED]

2008-11-01 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maciej W. Rozycki wrote:
 On Fri, 31 Oct 2008, Micah Cowan wrote:
 
 I will ask the dotsrc.org folks to set up this mailing list as a
 forwarding alias to [EMAIL PROTECTED] (the reverse of recent history). At
 that time, no further mails will be sent to subscribers of this list.
 Please subscribe to [EMAIL PROTECTED] instead.

 At this time, I'm thinking of merging wget@sunsite.dk and
 [EMAIL PROTECTED]; there isn't really enough traffic to justify
 separate lists, IMO; and often discussions come up on submitted patches
 that are of interest to everyone.
 
  I am puzzled.  You mean you declare wget@sunsite.dk retired and 
 [EMAIL PROTECTED] is to be used from now on for the purpose the former 
 list instead?  And [EMAIL PROTECTED] will most likely be retired 
 as well soon with the replacement to be [EMAIL PROTECTED] as well?

Yup, that's what I mean.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJDIA77M8hyUobTrERAkr4AJwK7uoprV2Am1j9dAzNkLgQLZz8FwCdEM2q
2AMuQCNzrZzsVaz1UxvBCuk=
=WiLZ
-END PGP SIGNATURE-


MAILING LIST IS MOVING: [EMAIL PROTECTED]

2008-10-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

[EMAIL PROTECTED] is now back in business as a full-fledged mailing list,
and not just a forwarding alias to here. Please subscribe using the
interface at http://lists.gnu.org/mailman/listinfo/bug-wget/ at your
earliest convenience.

I had hoped to leave forwarding still enabled during the transition; I
subscribed wget@sunsite.dk but that did not seem to do the trick. So
mails at [EMAIL PROTECTED] will not show up here at the present time.

I will ask the dotsrc.org folks to set up this mailing list as a
forwarding alias to [EMAIL PROTECTED] (the reverse of recent history). At
that time, no further mails will be sent to subscribers of this list.
Please subscribe to [EMAIL PROTECTED] instead.

At this time, I'm thinking of merging wget@sunsite.dk and
[EMAIL PROTECTED]; there isn't really enough traffic to justify
separate lists, IMO; and often discussions come up on submitted patches
that are of interest to everyone.

Please avoid continued use of this list if possible. The gmane and
mail-archive.com sites will be asked to use the new list for archiving
purposes (and of course, bug-wget will also be archived via GNU's
pipermail setup).

Some of the reasons for this migration may be found at
http://article.gmane.org/gmane.comp.web.wget.general/8200/
In addition, people have recently been having difficulties with spam
blocking preventing their unsubscription(!), subscription, or even
contacting dotsrc.org staff about resolving subscription problems.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJC9/37M8hyUobTrERAuaMAJ9ByOhOnpQr81q6BJO/ytA4wUQkdgCfcPq0
3q88DFI/PL3LtcIx6ky9Vd8=
=czx7
-END PGP SIGNATURE-


Re: MAILING LIST IS MOVING: [EMAIL PROTECTED]

2008-10-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 [EMAIL PROTECTED] is now back in business as a full-fledged mailing list,
 and not just a forwarding alias to here. Please subscribe using the
 interface at http://lists.gnu.org/mailman/listinfo/bug-wget/ at your
 earliest convenience.

Email interface: send an email to [EMAIL PROTECTED]

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJC+vL7M8hyUobTrERAmEsAJ49xkwHMv75li+ihHV38NIP44ho4QCfaAue
hUPMKQbmpdrYqPO8M8CSrzE=
=CwYx
-END PGP SIGNATURE-


Re: -m alias

2008-10-28 Thread Micah Cowan
Michelle Konzack wrote:
 ???  --  How can you post without being subscribed?  My posts  went  all
 definitively rejected when I tried to post to this list.

Strange. People are definitely posting to the list without having to be
subscribed.

However, folks have been known to be rejected as spam, even for
unsubscription requests. :\

I've been considering a move to gnu servers; but I'm not sure their spam
filters are better (though at least they wouldn't reject unsubscriptions
I think). But mostly, I'm not motivated enough to get off my lazy butt
yet. If we start having more serious problems, perhaps the motivation
will increase sufficiently...

-- 
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/


Re: wget re-download fully downloaded files

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maksim Ivanov wrote:
 I'm trying to download the same file from the same server, command line
 I use:
 wget --debug -o log  -c -t 0 --load-cookies=cookie_file
 http://rapidshare.com/files/153131390/Blind-Test.rar
 
 Below attached 2 files: log with 1.9.1 and log with 1.10.2
 Both logs are made when Blind-Test.rar was already on my HDD.
 Sorry for some mess in logs, but russian language used on my console.

Thanks very much for providing these, Maksim; they were very helpful.
(Sorry for getting back to you so late: it's been busy lately).

I've confirmed this behavioral difference (though I compared the current
development sources against 1.8.2, rather than 1.10.2 to 1.9.1). Your
logs involve a 302 redirection before arriving at the real file, but
that's just a red herring.

The difference is that when 1.9.1 encountered a server that would
respond to a byte-range request with 200 (meaning it doesn't know how
to send partial contents), but with a Content-Length value matching the
size of the local file, then wget would close the connection and not
proceed to redownload. 1.10.2, on the other hand, would just re-download it.

Actually, I'll have to confirm this, but I think that current Wget will
re-download it, but not overwrite the current content, until it arrives
at some content corresponding to bytes beyond the current content.

I need to investigate further to see if this change was somehow
intentional (though I can't imagine what the reasoning would be); if I
don't find a good reason not to, I'll revert this behavior. Probably for
the 1.12 release, but I might possibly punt it to 1.13 on the grounds
that it's not a recent regression (however, it should really be a quick
fix, so most likely it'll be in for 1.12).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBfOj7M8hyUobTrERAjNTAJ9ayaKLvN4bYS/7o0kYcQywDvfwNgCfcGzz
P9aAwVD6Q/xQuACjU7KF1ng=
=m5QO
-END PGP SIGNATURE-


Re: --mirror and --cut-dirs=2 bug?

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brock Murch wrote:
 I try to keep a mirror of NASA atteph ancilliary data for modis processing. I 
 know that means little, but I have a cron script that runs 2 times a day. 
 Sometimes it works, and others, not so much. The sh script is listed at the 
 end of this email below. As is the contents of the remote ftp server's root 
 and portions fo the log. 
 
 I don't need all the data on the remote server, only some thus I use 
 --cut-dirs.To make matters stranger, the software (also from NASA) that uses 
 these files, looks for them in a single place on the client machine where the 
 software runs, but needs data from 2 different directories on the remote ftp 
 server. If the data is not on the client machine, the software kindly ftp's 
 the files to the local directory. However, I don't allow write access to that 
 directory as many people use the software and when it is d/l'ed it has the 
 wrong perms for others to use it, thus I mirror the data I need from the ftp 
 site locally. In the script below, there are 2 wget commands, but they are to 
 slightly different directories (MODISA  MODIST).

I wouldn't recommend that. Using the same output directory for two
different source directories seems likely to lead to problems. You'd
most likely be better off by pulling to two locations, and then
combining them afterwards.

I don't know for sure that it _will_ cause problems (except if they
happen to have same-named files), as long as .listing files are being
properly removed (there were some recently-fixed bugs related to that, I
think? ...just appending new listings on top of existing files).

 It appears to me that the problem occurs if there is a ftp server error, and 
 wget starts a retry. wget goes to the server root, gets the .listing from 
 there for some reason (as opposed to the directory it should go to on the 
 server), and then goes to the dir it needs to mirror and can't find the files 
 (that are listed in the root dir) and creates dirs, and then I get No such 
 file errors and recursive directories created. Any advice would be 
 appreciated.

This snippet seems to be the source of the problem:

 Error in server response, closing control connection.
 Retrying.
 
 - --14:53:53--  ftp://oceans.gsfc.nasa.gov/MODIST/ATTEPH/2002/110/
   (try: 2) = `/home1/software/modis/atteph/2002/110/.listing'
 Connecting to oceans.gsfc.nasa.gov|169.154.128.45|:21... connected.
 Logging in as anonymous ... Logged in!
 == SYST ... done.== PWD ... done.
 == TYPE I ... done.  == CWD not required.
 == PASV ... done.== LIST ... done.

That CWD not required bit is erroneous. I'm 90% sure we fixed this
issue recently (though I'm not 100% sure that it went to release: I
believe so).

I believe we made some related fixes more recently. You provided a great
amount of useful information, but one thing that seems to be missing (or
I missed it) is the Wget version number. Judging from the log, I'd say
it's 1.10.2 or older; the most recent version of Wget is 1.11.4; could
you please try to verify whether Wget continues to exhibit this problem
in the latest release version?

I'll also try to look into this as I have time (but it might be awhile
before I can give it some serious attention; it'd be very helpful if you
could do a little more legwork).

- --
Thanks very much,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBgNh7M8hyUobTrERAuGoAKCCUoBN0sURKA/51x0o4HN59K8+AACfUYuj
i8XW58MvjvbS3oy4OsOmbpc=
=4kpD
-END PGP SIGNATURE-


Re: --mirror and --cut-dirs=2 bug?

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 I believe we made some related fixes more recently. You provided a great
 amount of useful information, but one thing that seems to be missing (or
 I missed it) is the Wget version number. Judging from the log, I'd say
 it's 1.10.2 or older; the most recent version of Wget is 1.11.4; could
 you please try to verify whether Wget continues to exhibit this problem
 in the latest release version?

This problem looks like the one that Mike Grant fixed in October of
2006: http://hg.addictivecode.org/wget/1.11/rev/161aa64e7e8f, so it
should definitely be fixed in 1.11.4. Please let me know if it isn't.

- --
Regards,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBgY+7M8hyUobTrERArrRAJ4p4Y7jwWfic0Wul7UBnBXlSzD2XQCePifc
kWs00JOULkzJmzozK7lmcfA=
=iSL3
-END PGP SIGNATURE-


More on query matching [Re: Need Design Documents]

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

kalpana ravi wrote:
 Hi Everybody,

Hi kalpana,

You sent this message to me and [EMAIL PROTECTED]; you
wanted [EMAIL PROTECTED]

 My name is kalpana Ravi.I am planning to contribute to add one of the
 features listed in https://savannah.gnu.org/bugs/?22089. For that i need to
 know the design diagrams to understand better. Does anybody know where the
 UML diagrams are there?

We don't have UML diagrams for wget: you'll just have to read the
sources (which, unfortunately, are messy). I have some rough-draft
diagrams of how I _want_ wget to look eventually, but I'm not done with
those, and anyway they wouldn't help you with wget now. Even if you had
the UML diagrams for the current state, you'd still need to understand
the sources; I really don't think they'd help you much.

More important than understanding the design, is understanding what
needs to be done; we're still getting a grip on that. My current thought
is that there should be a --query-reject (and probably --query-accept,
though the former seems far more useful) that should be matched against
key/value pairs; thus, --query-reject 'foo=baraction=edit' would reject
anything that has foo=bar and action=edit as the key/value pairs in
the query string, even if they're not actually next to each other; an
example rejected URL might be
http://example.com/index.php?a=baction=edittoken=blahfoo=barhergle.

Not all query strings are in the key=value format, so --query-reject
'abc1254' would be allowed, and match against the entire query string.

For an idea how URL filename matching is currently done, you might check
out acceptable src/util.c and the functions it calls, to get an idea
of how query matching might be implemented. However, I'll probably
tackle this bug myself pretty soon if no one else has managed it yet, as
I'm very interested in getting Wget 1.12 finished before long into the
new year (ideally, _before_ the new year, but that probably ain't gonna
happen).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBgt77M8hyUobTrERAnqrAJ921WjEax0kMFf5Ls70Lvvq6LBItgCeL6wj
UWA/2b+kVMw8L8IsVjIAGhI=
=WKJk
-END PGP SIGNATURE-


Re: wget re-download fully downloaded files

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maksim Ivanov wrote:
 I'm trying to download the same file from the same server, command line
 I use:
 wget --debug -o log  -c -t 0 --load-cookies=cookie_file
 http://rapidshare.com/files/153131390/Blind-Test.rar
 
 Below attached 2 files: log with 1.9.1 and log with 1.10.2
 Both logs are made when Blind-Test.rar was already on my HDD.
 Sorry for some mess in logs, but russian language used on my console.

This is currently being tracked at https://savannah.gnu.org/bugs/?24662

A similar and related bug report is at
https://savannah.gnu.org/bugs/?24642 in which the logs show that
rapidshare.com issues also issues erroneous Content-Range information
when it responds with a 206 Partial Content, which exercised a different
regression* introduced in 1.11.x.

* It's not really a regression, since it's desirable behavior: we now
determine the size of the content from the content-range header, since
content-length is often missing or erroneous for partial content.
However, in this instance of server error, it resulted in less-desirable
behavior than the previous version of Wget. Anyway...

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBhvA7M8hyUobTrERAty1AKCEscXut6FDXvXlxpuSBtKkii1/awCeJH0M
+JcJ5xG67K7CxHBEcV1x/zY=
=D2uE
-END PGP SIGNATURE-


Re: re-mirror + no-clobber

2008-10-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jonathan Elsas wrote:
...
 I've issued the command
 
 wget -nc -r -l inf -H -D www.example.com,www2.example.com
 http://www.example.com
 
 but, I get the message:
 
 
 file 'www.example.com/index.html' already there; not retrieving.
 
 
 and the process exits.   According to the man page files with .html
 suffix will be loaded off disk and parsed but this does not appear to
 be happening.   Am I missing something?

Yes. It has to download the files before they can be loaded from the
disk and parsed. When it encounters a file at a given location, it
doesn't have any way to know that that file corresponds to the one it's
trying to download. Timestamping with -N may be more what you want,
rather than -nc?

I'm open to suggestions on clarifying the documentation.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJA7Ds7M8hyUobTrERAsONAJ0dqYh0av7rQ80F8JIcvxhZ1ee7fwCdFG+y
AJJxMPVzHpmqAy7iGVRWmCU=
=wwns
-END PGP SIGNATURE-


Re: accept/reject rules based on querysting

2008-10-21 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Gustavo Ayala wrote:
 Any ideas about when this option (or an acceptable workaround) will be 
 implemented ?
  
 I need to include/exclude based on querysting (with regular expression of 
 course). File name is not enough.

I consider it an important feature, and currently expect to implement it
for 1.12.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI/faT7M8hyUobTrERApXLAJsFFMsVcibgLlptVhJoMwZeLYg02wCfTLSs
ayyryt3wCnkwtAStESYp7cs=
=dB6e
-END PGP SIGNATURE-


Re: A/R matching against query strings

2008-10-21 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I sent the following last month but didn't get any feedback. I'm trying
one more time. :)
- -M

Micah Cowan wrote:
 On expanding current URI acc/rej matches to allow matching against query
 strings, I've been considering how we might enable/disable this
 functionality, with an eye toward backwards compatibility.
 
 It seems to me that one usable approach would be to require the ?
 query string to be an explicit part of rule, if it's expected to be
 matched against query strings. So -A .htm,.gif,*Action=edit* would all
 result in matches against the filename portion only, but -A
 '\?*Action=edit*' would look for Action=edit within the query-string
 portion. (The '\?' is necessary because otherwise '?' is a wildcard
 character; [?] would also work.)
 
 The disadvantage of that technique is that it's harder to specify that a
 given string should be checked _anywhere_, regardless of whether it
 falls in the filename or query-string portion; but I can't think offhand
 of any realistic cases where that's actually useful. We could also
 supply a --match-queries option to turn on matching of wildcard rules
 for anywhere (non-wildcard suffix rules should still match only at the
 end of the filename portion).
 
 Another option is to use a separate -A-like option that does what -A
 does for filenames, but matches against query strings. I like this idea
 somewhat less.
 
 Thoughts?
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI/fhT7M8hyUobTrERAgvtAJ0daQEub5GS4EFc7BuGT0pG1E1n0wCgjbnx
zb1QK0suZx0woMauqfL0qZI=
=5mdh
-END PGP SIGNATURE-


Re: A/R matching against query strings

2008-10-21 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Micah Cowan wrote:
 
 On expanding current URI acc/rej matches to allow matching against query
 strings, I've been considering how we might enable/disable this
 functionality, with an eye toward backwards compatibility.
 
 What about something like --match-type=TYPE (with accepted values of all,
 hash, path, search)?
 
 For the URL http://www.domain.com/path/to/name.html?a=true#content
 
 all would match against the entire string
 hash would match against content
 path would match against path/to/name.html
 search would match against a=true
 
 For backward compatibility the default should be --match-type=path.
 
 I thought about having host as an option, but that duplicates another
 option.

As does path (up to the final /).

Would hash really be useful, ever? It's never part of the request to
the server, so it's really more context to the URL than a real part of
the URL, as far as requests go. Perhaps that sort of thing could best
wait for when we allow custom URL-parsers/filters.

Also, I don't like the name search overly much, as that's a very
limited description of the much more general use of query strings.

But differentiating between three or more different match types tilts me
much more strongly toward some sort of shorthand, like the explicit need
for \?; with three types, perhaps we'd just use some special prefix for
patterns to indicate which sort of match we want (:q: query strings,
:a: for all, or whatever), to save on prefix each different type of
match with --match-type (or just using all for everything).

OTOH, regex support is easy enough to add to Wget, now that we're using
gnulib; we could just leave wildcards the way they are, and introduce
regexes that match everything. Then query strings are '\?.*foo=bar' (or,
for the really pedantic, '\?([^?]*)?foo=bar([^?]*)?$')

That last one, though, highlights how cumbersome it is to do proper
matching against typical HTML form-generated query strings (it's not
really even possible with wildcards). Perhaps a more appropriate
pattern-matcher specifically for query strings would be a good idea.
It's probably enough to do something like --query-='action=Edit', where
there's an implied '\?([^?]*)?' before, and '([^?]*)?$' after.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI/qLZ7M8hyUobTrERAmRdAJsH+9p+mTafoxqeVOstTPKrZP31CACdECCa
vQ1lZnncrdHd8SSbXevK02Y=
=YC2A
-END PGP SIGNATURE-


Re: wget re-download fully downloaded files

2008-10-12 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maksim Ivanov wrote:
 Hello!
 
 Starting version 1.10 wget has very annoying bug: if you trying download
 already fully downloaded file, wget begin download it over,
 but 1.9.1 says: Nothing to do as it must to be.

It all depends on what options you specify. That's as true for 1.9 as it
is for 1.10 (or the current release 1.11.4).

It can also depend on the server; not all of them support timestamping
or partial fetches.

Please post the minimal log that exhibits the problem you're experiencing.

- --
Thanks,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI8mrL7M8hyUobTrERAqx4AJ9yQb+kPXGI2N7sv34krZLnYDuRvgCfWI2K
nZYI8ER1PB3pkYC4neiTa9U=
=JW3/
-END PGP SIGNATURE-


Re: Incorrect transformation of newline's symbols

2008-10-07 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Александр Вильнин wrote:
 Hello!
 
 I've noticed some posible mistake in ftp-basic.c.
 
 When I try to download a file from
 ftp://www.delorie.com/pub/djgpp/current/; (in my case it was
 ftp://www.delorie.com/pub/djgpp/current/FILES;) server responce error
 no.550. But this file actually exists.
 I've used
 (wget --verbose --debug --output-file=wget_djgpp_log
 --directory-prefix=djgpp ftp://www.delorie.com/pub/djgpp/current/FILES;)
 cygwin command to get this file.
 
 In function ftp_request (ftp-basic.c) newline's characters are
 substituted on ' ', but ftp-server doesn't understand such commands.
 SIZE and RETR commands do not pass.
 I've insert debug log at the end of this message.

The problem isn't that newlines are substituted. Newlines and carriage
returns are simply not safe within FTP file names.

However, how did the newline get there in the first place? The real file
name itself doesn't have a newline in it. The logs clearly show that
Wget was passed a URL with a carriage return (not newline) in it. This
strongly indicates that the shell you were using passed it that way to
Wget. Probably, the shell was given \r\n when you hit Enter to end
your command, and stripped away the \n but left the \r, which it passed
to Wget.

The bug you are encountering is in your Cygwin+shell environment; you'll
have to look to there. The only deficiency I'm seeing on Wget's part
from these logs, is that it's calling \015 a newline character, when
in fact the newline character is \012; it should say line-ending
character or some such.

- --
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI67fs7M8hyUobTrERArlfAJ0TurMdyGK0YR9UK263h8p2ZesqXQCfdQo3
Tn4oDFWJg9JIyTEQOJ2jrCE=
=Y/Sy
-END PGP SIGNATURE-


Re: Support for file://

2008-09-22 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

David wrote:
 
 Hi Micah,
 
 Your're right - this was raised before and in fact it was a feature
 Mauro Tortonesi intended to be implemented for the 1.12 release, but it
 seems to have been forgotten somewhere along the line. I wrote to the
 list in 2006 describing what I consider a compelling reason to support
 file:// file:///. Here is what I wrote then:
 
 At 03:45 PM 26/06/2006, David wrote:
 In replies to the post requesting support of the file:// scheme,
 requests were made for someone to provide a compelling reason to want to
 do this. Perhaps the following is such a reason.
 I have a CD with HTML content (it is a CD of abstracts from a scientific
 conference), however for space reasons not all the content was included
 on the CD - there remain links to figures and diagrams on a remote web
 site. I'd like to create an archive of the complete content locally by
 having wget retrieve everything and convert the links to point to the
 retrieved material. Thus the wget functionality when retrieving the
 local files should work the same as if the files were retrieved from a
 web server (i.e. the input local file needs to be processed, both local
 and remote content retrieved, and the copies made of the local and
 remote files all need to be adjusted to now refer to the local copy
 rather than the remote content). A simple shell script that runs cp or
 rsync on local files without any further processing would not achieve
 this aim.

Fair enough. This example at least makes sense to me. I suppose it can't
hurt to provide this, so long as we document clearly that it is not a
replacement for cp or rsync, and is never intended to be (won't handle
attributes and special file properties).

However, support for file:// will introduce security issues, care is needed.

For instance, file:// should never be respected when it comes from the
web. Even on the local machine, it could be problematic to use it on
files writable by other users (as they can then craft links to download
privileged files with upgraded permissions). Perhaps files that are only
readable for root should always be skipped, or wget should require a
--force sort of option if the current mode can result in more
permissive settings on the downloaded file.

Perhaps it would be wise to make this a configurable option. It might
also be prudent to enable an option for file:// to be disallowed for root.

https://savannah.gnu.org/bugs/?24347

If any of you can think of additional security issues that will need
consideration, please add them in comments to the report.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI19aE7M8hyUobTrERAt49AJ4irLGMd6OVRWeooKPqZxmX0+K2agCfaq2d
Mx9IgSo5oUDQgBPD01mcGcY=
=sdAZ
-END PGP SIGNATURE-


Re: Support for file://

2008-09-20 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Michelle Konzack wrote:
 Imagine you have a local mirror of your website and you want to know why
 the site @HOSTINGPROVIDER has some files more or such.
 
 You can spider the website @HOSTINGPROVIDER recursiv in a  local  tmp1
 directory and then, with the same commandline, you can do the same  with
 the local mirror and download the files recursive into tmp2 and  now
 you and now you can make a recursive fs-diff and know  which  files  are
 used...  on both, the local mirror and @HOSTINGPROVIDER

I'm confused. If you can successfully download the files from
HOSTINGPROVIDER in the first place, then why would a difference exist?
And if you can't, then this wouldn't be an effective way to find out.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI1dYe7M8hyUobTrERAuuyAJ9m3ArCqxG4orhAQuEM010yWv6ScwCfaE9h
jXIjJ+XUjBYwyBdi8NB/rEY=
=NDnR
-END PGP SIGNATURE-


Re: Big files

2008-09-16 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Cristián Serpell wrote:
 It is the latest Ubuntu's distribution, that still comes with the old
 version.
 
 Thanks anyway, that was the problem.

I know that's untrue. Ubuntu comes with 1.10.2 at least, and has for
quite some time. If you're using that, then it's probably a different
bug than Doruk and Tony were thinking of (perhaps one of the cases of
content-length mishandling that were recently fixed in the 1.11.x series).

IIRC Intrepid Ibex (Ubuntu 8.10) will have 1.11.4.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFI0AnI7M8hyUobTrERAqptAJoCj0VC46dBOhrr/A3HsHyicciKWQCffyFQ
bHhmuYHmf52Yz1M5lu7Yk5Y=
=Z+fN
-END PGP SIGNATURE-


Re: Hiding passwords found in redirect URLs

2008-09-13 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thomas Corthals wrote:
 Micah Cowan wrote:

 Note: Saint Xavier has already written a fix for this, so it's not
 actually a question of whether it's worth the bother, just whether it's
 actually desired behavior.
 
 Since it's desired in some situations but maybe not in others, the best
 solution would be to provide a switch for it that can be used in a
 user's .wgetrc and on the command line.

Well, yes, except I can't really imagining anyone ever _using_ such a
switch. Though I could envision people using the .wgetrc option. Still
seems like a lot of trouble to make a new option for such a little
thing. One could always use -nv in a pinch.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIzBiU7M8hyUobTrERAkchAJ9vajvughHFXR8yAJPPGt4YkaGY8ACfYXCR
vPCAZaYsRN6VcisBjDkmdzI=
=wMVt
-END PGP SIGNATURE-


A/R matching against query strings

2008-09-12 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On expanding current URI acc/rej matches to allow matching against query
strings, I've been considering how we might enable/disable this
functionality, with an eye toward backwards compatibility.

It seems to me that one usable approach would be to require the ?
query string to be an explicit part of rule, if it's expected to be
matched against query strings. So -A .htm,.gif,*Action=edit* would all
result in matches against the filename portion only, but -A
'\?*Action=edit*' would look for Action=edit within the query-string
portion. (The '\?' is necessary because otherwise '?' is a wildcard
character; [?] would also work.)

The disadvantage of that technique is that it's harder to specify that a
given string should be checked _anywhere_, regardless of whether it
falls in the filename or query-string portion; but I can't think offhand
of any realistic cases where that's actually useful. We could also
supply a --match-queries option to turn on matching of wildcard rules
for anywhere (non-wildcard suffix rules should still match only at the
end of the filename portion).

Another option is to use a separate -A-like option that does what -A
does for filenames, but matches against query strings. I like this idea
somewhat less.

Thoughts?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIyrXz7M8hyUobTrERAk+5AJ0ckiE4+bEMEFe9aD8bBNY3HH+IZACdERCs
wab0TyBLCbW/6DYm+8gAExM=
=pwb/
-END PGP SIGNATURE-


Hiding passwords found in redirect URLs

2008-09-12 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

https://savannah.gnu.org/bugs/index.php?21089

The report originator is copied in the recipients list for this message.

The situation is as follows: the user types wget
http://foo.com/file-i-want;. Wget asks the HTTP server for the
appropriate file, and gets a 302 redirection to the URL
ftp://spag:[EMAIL PROTECTED]. Wget will then issue to the log output, the line:

  Location: ftp://spag:[EMAIL PROTECTED]/mickie/file-you-want

with the password in plain view.

I'm uncertain that this is actually a problem. In this specific case,
it's a publicly-accessible URL redirecting to a password-protected file.
What's to hide, really?

Of course, the case gets more interesting when it's _not_ a
publicly-accessible URL. What about when the password is generated from
one the user supplied? That is, the original request was
http://spag:[EMAIL PROTECTED]/file-i-want, which resulted in a redirect
using the same username/password? Especially if it was an HTTPS request
rather than plain HTTP. A case could be made that it should be hidden in
that case.

On the other hands, in cases like the _original_ example given above,
I'd argue that hiding it could be the wrong thing: the user now has no
idea how to directly access the file, avoiding the redirect the next
time around.

Redirecting to a password-protected file on a different host or using a
different scheme seems broken to me in the first place, and I'm sorta
leaning towards not bothering about it. What are your thoughts, list?

Note: Saint Xavier has already written a fix for this, so it's not
actually a question of whether it's worth the bother, just whether it's
actually desired behavior.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIytyT7M8hyUobTrERAnC1AJ4pRpWx7z6wRt3Vg4LHyQalEfL3XQCdGTqg
LdK8lQ8tuPTlmCfURcjXPw4=
=ZPrY
-END PGP SIGNATURE-


Re: Where is program_name?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Saint Xavier wrote:
 Hi,
 
 * Gisle Vanem ([EMAIL PROTECTED]) wrote:
 'program_name' is used in lib/error.c, but it is not allocated anywhere. 
 Should it be added to main.c and initialised to exec_name?
 
 $cd wget-mainline
 $find . -name '*.[ch]' -exec fgrep -H -n 'program_name' '{}' \;
 ./lib/error.c:63:# define program_name program_invocation_name
^^^
 ./lib/error.c:95:/* The calling program should define program_name and set it 
 to the
  ^^^

Looks to me like we're expected to supply it. Line 63 is only evaluated
when we're using glibc; otherwise, we need to provide it. The differing
name is probably so we can define it unconditionally.

It appears that lib/error.c isn't even _built_ on my system, perhaps
because glibc supplies what it would fill in. This makes testing a
little dificult. Anyway, see if this fixes your trouble:

diff -r 0c2e02c4f4f3 src/ChangeLog
- --- a/src/ChangeLog Tue Sep 09 09:29:50 2008 -0700
+++ b/src/ChangeLog Tue Sep 09 09:40:00 2008 -0700
@@ -1,3 +1,7 @@
+2008-09-09  Micah Cowan  [EMAIL PROTECTED]
+
+   * main.c: Define program_name for lib/error.c.
+
 2008-09-02  Gisle Vanem  [EMAIL PROTECTED]

* mswindows.h: Must ensure stdio.h is included before
diff -r 0c2e02c4f4f3 src/main.c
- --- a/src/main.cTue Sep 09 09:29:50 2008 -0700
+++ b/src/main.cTue Sep 09 09:40:00 2008 -0700
@@ -826,6 +826,8 @@
   exit (0);
 }

+char *program_name; /* Needed by lib/error.c. */
+
 int
 main (int argc, char **argv)
 {
@@ -833,6 +835,8 @@
   int i, ret, longindex;
   int nurl, status;
   bool append_to_log = false;
+
+  program_name = argv[0];

   i18n_initialize ();



- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxqf67M8hyUobTrERAq0+AJ9KIOFDn9FiDXIIlU6M7DsupDmPYQCcDuoo
9bgAQnuKpgYMvnwc18svfYg=
=DXYi
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg [EMAIL PROTECTED] wrote:
 On Mon, 8 Sep 2008, Donald Allen wrote:

 The page I get is what would be obtained if an un-logged-in user went to
 the specified url. Opening that same url in Firefox *does* correctly
 indicate that it is logged in as me and reflects my customizations.
 First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts
 need. Then you read the capure and replay them as closely as possible using
 your tool.

 As you will find out, sites like this use all sorts of funny tricks to
 figure out you and to make it hard to automate what you're trying to do.
 They tend to use javascripts for redirects and for fiddling with cookies
 just to make sure you have a javascript and cookie enabled browser. So you
 need to work hard(er) when trying this with non-browsers.

 It's certainly still possible, even without using the browser to get the
 first cookie file. But it may take some effort.
 
 I have not been able to retrieve a page with wget as if I were logged
 in using --load-cookies and Micah's suggestion about 'Accept-Encoding'
 (there was a typo in his message -- it's 'Accept-Encoding', not
 'Accept-Encodings'). I did install livehttpheaders and tried
 --no-cookies and --header cookie info from livehttpheaders and that
 did work.

That's how I did it as well (except I got the headers from tcpdump); I'm
using Firefox 3, so don't have access to FF's new sqllite-based cookies
file (apart from the patch at
http://wget.addictivecode.org/FrontPage?action=AttachFiledo=viewtarget=wget-firefox3-cookie.patch).

 Some of the cookie info sent by Firefox was a mystery,
 because it's not in the cookie file. Perhaps that's the crucial
 difference -- I'm speculating that wget isn't sending quite the same
 thing as Firefox when --load-cookies is used, because Firefox is
 adding stuff that isn't in the cookie file. Just a guess.

Probably there are session cookies involved, that are sent in the first
page, that you're not sending back with the form submit.
- --keep-session-cookies and --save-cookies=foo.txt make a good
combination.

 Is there a
 way to ask wget to print the headers it sends (ala livehttpheaders)?
 I've looked through the options on the man page and didn't see
 anything, though I might have missed it.

- --debug

- --
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxqL77M8hyUobTrERAovFAJ9yagS2xW+2wFG65BwiFkJNfTMylgCfYaq7
1vOmTDimFg8E7Cn+Q+HGZn8=
=JKXH
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 The result of this test, just to be clear, was a page that indicated
 yahoo thought I was not logged in. Those extra items firefox is sending
 appear to be the difference, because I included them (from the
 livehttpheaders output) when I tried sending the cookies manually with
 --header, I got the same page back with wget that indicated that yahoo
 knew I was logged in and formatted with page with my preferences.

Perhaps you missed this in my last message:

 Probably there are session cookies involved, that are sent in the first
 page, that you're not sending back with the form submit.
 --keep-session-cookies and --save-cookies=foo.txt make a good
 combination.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxrJ17M8hyUobTrERAvdsAJ9XEwMfimHXRUXKtV66P+YsG+tA7gCfWKbq
nCqAmXJfU3kTncMQkKk0JZo=
=17Yr
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 I am doing the yahoo session login with firefox, not with wget, so I'm
 using the first and easier of your two suggested methods. I'm guessing
 you are thinking that I'm trying to login to the yahoo session with
 wget, and thus --keep-session-cookies and --save-cookies=foo.txt would
 make perfect sense to me, but that's not what I'm doing (yet -- if I'm
 right about what's happening here, I'm going to have to resort to this).
 But using firefox to initiate the session, it looks to me like wget
 never gets to see the session cookies because I don't think firefox
 writes them to its cookie file (which actually makes sense -- if they
 only need to live as long as the session, why write them out?).

Yes, and I understood this; the thing is, that if session cookies are
involved (i.e., cookies that are marked for immediate expiration and are
not meant to be saved to the cookies file), then I don't see how you
have much choice other than to use the harder method, or else to fake
the session cookies by manually inserting them to your cookies file or
whatnot (not sure how well that may be expected to work). Or, yeah, add
an explicit --header 'Cookie: ...'.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxrVD7M8hyUobTrERAt19AJ9bmmczCKjzMtGCoXb8B5g25uMLRQCeK8qh
M57W3Reqj+/pO8GuDwb9Nok=
=ajp/
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 
 
 On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED] wrote:
 
 Donald Allen wrote:
 I am doing the yahoo session login with firefox, not with wget,
 so I'm
 using the first and easier of your two suggested methods. I'm
 guessing
 you are thinking that I'm trying to login to the yahoo session with
 wget, and thus --keep-session-cookies and
 --save-cookies=foo.txt would
 make perfect sense to me, but that's not what I'm doing (yet --
 if I'm
 right about what's happening here, I'm going to have to resort to
 this).
 But using firefox to initiate the session, it looks to me like wget
 never gets to see the session cookies because I don't think firefox
 writes them to its cookie file (which actually makes sense -- if they
 only need to live as long as the session, why write them out?).
 
 Yes, and I understood this; the thing is, that if session cookies are
 involved (i.e., cookies that are marked for immediate expiration and are
 not meant to be saved to the cookies file), then I don't see how you
 have much choice other than to use the harder method, or else to fake
 the session cookies by manually inserting them to your cookies file or
 whatnot (not sure how well that may be expected to work). Or, yeah, add
 an explicit --header 'Cookie: ...'.
 
 
 Ah, the misunderstanding was that the stuff you thought I missed was
 intended to push me in the direction of Plan B -- log in to yahoo with
 wget.

Yes; and that's entirely my fault, as I didn't explicitly say that.

 I understand now. I'll look at trying to make this work. Thanks
 for all the help, though I can't guarantee that you are done yet :-)
 But, hopefully, this exchange will benefit others.

I was actually surprised you kept going after I pointed out that it
required the Accept-Encoding header that results in gzipped content.
This behavior is a little surprising to me from Yahoo!. It's not
surprising in _general_, but for a site that really wants to be as
accessible as possible (I would think?), insisting on the latest
browsers seems ill-advised.

Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape,
visit a site, and get a server-generated page that's empty other than
the phrase You're not using Internet Explorer. :p

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik
3HbbATyqnrm0hAJXqNTqpl4=
=3XD/
-END PGP SIGNATURE-


Re: Hello, All and bug #21793

2008-09-08 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

David Coon wrote:
 Hello everyone,
 
 I thought I'd introduce myself to you all, as I intend to start helping
 out with wget.  This will be my first time contributing to any kind of
 free or open source software, so I may have some basic questions down
 the line about best practices and such, though I'll try to keep that to
 a minimum.
 
 Anyway, I've been researching unicode and utf-8 recently, so I'm gonna
 try to tackle bug #21793 https://savannah.gnu.org/bugs/?21793. 

Hi David, and welcome!

If you haven't already, please see
http://wget.addictivecode.org/HelpingWithWget

I'd encourage you to get a Savannah account, so I can assign that bug to
you. Also, I tend to hang out quite a bit on IRC (#wget @
irc.freenode.net), so you might want to sign on there.

Since you mentioned an interest in Unicode and UTF-8, you might want to
check out Saint Xavier's recent work on IRI and iDNS support in Wget,
which is available at http://hg.addictivecode.org/wget/sxav/.

Among other things, sxav's additions make Wget more aware of the user's
locale, so it might be useful for providing a feature to automatically
transcode filenames to the user's locale, rather than just supporting
UTF-8 only (which should still probably remain an explicit option). If
that sounds like the direction you'd like to take it, you should
probably base your work on sxav's repository, rather than mainline.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxViR7M8hyUobTrERAv/jAJ9/DxAaPaYpdLJojX9gorHn2hqwSACeK7oD
veVZAIH2NjbYI8dG6DimjRg=
=9Qau
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-09-08 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 There was a recent discussion concerning using wget to obtain pages
 from yahoo logged into yahoo as a particular user. Micah replied to
 Rick Nakroshis with instructions describing two methods for doing
 this. This information has also been added by Micah to the wiki.
 
 I just tried the simpler of the two methods -- logging into yahoo with
 my browser (Firefox 2.0.0.16) and then downloading a page with
 
 wget --output-document=/tmp/yahoo/yahoo.htm --load-cookies my home
 directory/.mozilla/firefox/id2dmo7r.default/cookies.txt
 'http://yahoo url'
 
 The page I get is what would be obtained if an un-logged-in user went
 to the specified url. Opening that same url in Firefox *does*
 correctly indicate that it is logged in as me and reflects my
 customizations.

Are you signing into the main Yahoo! site?

When I try to do so, whether I use the cookies or no, I get a message
about update your browser to something more modern or the like. The
difference appears to be a combination of _both_ User-Agent (as you've
done), _and_ --header Accept-Encodings: gzip,deflate. This plus
appropriate cookies gets me a decent logged-in page, but of course it's
gzip-compressed.

Since Wget doesn't currently support gzip-decoding and the like, that
makes the use of Wget in this situation cumbersome. Support for
something like this probably won't be seen until 1.13 or 1.14, I'm afraid.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxdw77M8hyUobTrERAi/QAJ0atPMeUQ/0YCNwAP+XiH4nDyvclwCcDxYo
obud0CjpATBYDvA0eS3ZHGY=
=vv4R
-END PGP SIGNATURE-


Re: [wget-notify] add a new option

2008-09-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

houda hocine wrote:
  Hi,

Hi houda.

This message was sent to the wget-notify, which was not the proper
forum. Wget-notify is reserved for bug-change and (previously) commit
notifications, and is not intended for discussion (though I obviously
haven't blocked discussions; the original intent was to be able to
discuss commits, but I'm not sure I need to allow discussions any more,
so it may be disallowed soon).

The appropriate list would be wget@sunsite.dk, to which this discussion
has been redirected.

 we create a new format for archiviving (. warc), and we want to ensure
 that wget generate directly this format from the input url .
 You can help me by some ideas  to achieve this new option?
 The format is (warc -wget url)
 I am in the process of trying to understand the source code to add this
 new option.  Which .c  file fallows me to do this?

Doing this is not likely to be a trivial undertaking: the current
file-output interface isn't really abstracted enough to allow this, so
basically you'll need to modify most of the existing .c files. We are
hoping at some future point to allow for a more generic output format,
for direct output to (for instance) tarballs and .mhtml archives. At
that point, it'd probably be fairly easy to write extensions to do what
you want.

In the meantime, though, it'll be a pain in the butt. I can't really
offer much help; the best way to understand the source is to read and
explore it. However, on the general topic of adding new options to Wget,
Tony Lewis has written the excellent guide at
http://wget.addictivecode.org/OptionsHowto. Hope that helps!

Please note that I won't likely be entertaining patches to Wget to make
it output to non-mainstream archive formats, and even once generic
output mechanisms are supported, the mainstream archive formats will
most likely be supported as extension plugins or similar, and not as
built-in support within Wget.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvbyf7M8hyUobTrERApl8AJwNvWOdDd0Z//wbNzN/jyZFqKI5iQCfQOx4
3zlxPGaVqjsPhwa7ZwB4wrs=
=Zy+N
-END PGP SIGNATURE-


Re: Checking out Wget

2008-09-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

vinothkumar raman wrote:
 Hi all,
 
 I need to checkout the complete source into my local hard disk. I am using
 WinCVS when i searched for the module its saying that there is no module
 information out there. Could any one help me out i am a complete novice in
 this regard.

WinCVS won't work, because there _is_ in fact no CVS module for Wget.
Wget uses Mercurial as the source repository (and was using Subversion
prior to that). For more information about the Wget source repository
and its use, see http://wget.addictivecode.org/RepositoryAccess

That page focuses on using the hg command-line tool; you may prefer to
use TortoiseHg instead, http://tortoisehg.sourceforge.net/. The page
does offer additional information about the repository and what is
required to build from those sources.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvb4n7M8hyUobTrERAnquAJ9ItMQH1QYgXvyYTI6/IZDScIFGoACfVlqd
p+LMC9AK5/SwYPyuGVfd5Ns=
=RmLO
-END PGP SIGNATURE-


Re: [BUG:#20329] If-Modified-Since support

2008-09-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

vinothkumar raman wrote:
 We need to give out the time stamp the local file in the Request
 header for that we need to pass on the local file's time stamp from
 http_loop() to get_http() . The only way to pass on this without
 altering the signature of the function is to add a field to struct url
 in url.h
 
 Could we go for it?

That is acceptable.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvb5B7M8hyUobTrERAv2YAJ0ajYx+pynFLtV2YmEw7fA+vwf8ugCfSaU1
AFkIYSyyyS4egbyXjzBLXBo=
=fIT5
-END PGP SIGNATURE-


Re: [bug #20329] Make HTTP timestamping use If-Modified-Since

2008-09-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Yes, that's what it means.

I'm not yet committed to doing this. I'd like to see first how many
mainstream servers will respect If-Modified-Since when given as part of
an HTTP/1.0 request (in comparison to how they respond when it's part of
an HTTP/1.1 request). If common servers ignore it in HTTP/1.0, but not
in HTTP/1.1, that'd be an excellent case for holding off until we're
doing HTTP/1.1 requests.

Also, I don't think removing the previous HEAD request code is
entirely accurate: we probably would want to detect when a server is
feeding us non-new content in response to If-Modified-Since, and adjust
to use the current HEAD method instead as a fallback.

- -Micah

vinothkumar raman wrote:
 This mean we should remove the previous HEAD request code and use
 If-Modified-Since by default and have it to handle all the request and
 store pages if it is not returning a 304 response
 
 Is it so?
 
 
 On Fri, Aug 29, 2008 at 11:06 PM, Micah Cowan [EMAIL PROTECTED] wrote:
 Follow-up Comment #4, bug #20329 (project wget):

 verbatim-mode's not all that readable.

 The gist is, we should go ahead and use If-Modified-Since, perhaps even now
 before there's true HTTP/1.1 support (provided it works in a reasonable
 percentage of cases); and just ensure that any Last-Modified header is sane.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvb7t7M8hyUobTrERAsvQAJ4k7fKrsFtfC4MQtuvE3Ouwz6LseACePqt2
8JiRBKtEhmcK3schVVO347A=
=yCJV
-END PGP SIGNATURE-


Re: Support for file://

2008-09-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Petri Koistinen wrote:
 Hi,
 
 I would be nice if wget would also support file://.

Feel free to file an issue for this (I'll mark it Needs Discussion and
set at low priority). I'd thought there was already an issue for this,
but can't find it (either open or closed). I know this has come up
before, at least.

I think I'd need some convincing on this, as well as a clear definition
of what the scope for such a feature ought to be. Unlike curl, which
groks urls, Wget W(eb)-gets, and file:// can't really be argued to
be part of the web.

That in and of itself isn't really a reason not to support it, but my
real misgivings have to do with the existence of various excellent tools
that already do local-file transfers, and likely do it _much_ better
than Wget could hope to. Rsync springs readily to mind.

Even the system cp command is likely to handle things much better than
Wget. In particular, special OS-specific, extended file attributes,
extended permissions and the like, are among the things that existing
system tools probably handle quite well, and that Wget is unlikely to. I
don't really want Wget to be in the business of duplicating the system
cp command, but I might conceivably not mind file:// support if it
means simple _content_ transfer, and not actual file duplication.

Also in need of addressing is what recursion should mean for file://.
Between ftp:// and http://, recursion currently means different
things. In FTP, it means traverse the file hierarchy recursively,
whereas in HTTP it means traverse links recursively. I'm guessing
file:// should work like FTP (i.e., recurse when the path is a
directory, ignore HTML-ness), but anyway this is something that'd need
answering.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvcLq7M8hyUobTrERAl6YAJ9xeTINVkuvl8HkElYlQt7dAsUfHACfXRT3
lNR++Q0XMkcY4c6dZu0+gi4=
=mKqj
-END PGP SIGNATURE-


Re: How to debug wget ?

2008-09-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jinhui Li wrote:
 I am browsing the source code. And want to debug it to figure out how it
 works.
 
 So, somebody please tell me how to debug ( with GDB ) or where can I
 find information that I need.

IMO, GDB is a great tool for diagnosing a particular problem one
encounters with a program; it's not all that terribly useful for
actually understanding the code itself, though. I find it much quicker
to read through the code using a powerful viewer or editor, and making
use of tools such as cscope and ctags. The best editors, such as Vim and
Emacs, are integrated these tools, and so a simple control-click or key
combination can bring up the definition of the function being called or
the variable being referenced, or (in the case of cscope) the list of
places where a particular function is being called, etc.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvcPD7M8hyUobTrERAsCEAJ9oQDJWzD/OPAvzvgJorlByd4YqyACfdLM1
GmQUVu/xnQ7HOr493hiWG28=
=0XwB
-END PGP SIGNATURE-


Corrections to earlier discussion

2008-08-29 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi sxav,

So, 3.2 is the wrong section to pull from: what we already _have_ are
IRIs; we're converting them to URIs. So, section 3.1 applies, not 3.2.

The two-step process described by section 3.1 does not allow
already-percent-encoded values to be transformed: only international
characters will be percent-encoded.

In particular, this means that you will not need to distinguish whether
a percent-encoded sequence represents a valid UTF-8 character: all
percnet-encoded sequences should be passed through the resulting IRI as
they appeared originally.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIuGsh7M8hyUobTrERAtSYAJwKZDeb7pCQWq0+XAJNcCZ4Ay0qmACfX3ia
ERSpkhiiQsLJ8SdqUSktZLQ=
=rF5p
-END PGP SIGNATURE-


Re: Corrections to earlier discussion

2008-08-29 Thread Micah Cowan
Micah Cowan wrote:
 Hi sxav,

Er, yeah, that had been meant to go to [EMAIL PROTECTED], not
[EMAIL PROTECTED] Whoopsy! :)

-- 
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/


Re: Wget function

2008-08-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

karlito wrote:
 
 
 Hello,
  
 First of all i would thank you for your great tool
  
 I have a request
  
 i use this function to save url with absolute link so it's very good
  
 wget -k http://www.google.fr/
  
 but i want to save this file as other name than index.html like for
 example  google-is-good.html
  
 i have try this
  
 wget -k –output-document=google-is-good.html http://www.google.fr/
  
 is work except i lost absolute link and it's terrible

Yeah. Conversions won't work with --output-document, which behaves
rather like a shell redirection.

 i don't know how to fix this problem wich combinaison i have to made
 for use wget - k  with another name ??

You could always rename it afterwards.

In your specific case, the current development sources (which will
become Wget 1.12) have a --default-page=google-is-good.html option for
specifying the default page name, thanks to Joao Ferreira. It's not yet
available in any release.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIsv3N7M8hyUobTrERAskoAJ4lHZK+VEBWYuFzOtbd57wEEvYm0wCdEVSK
el6v3e0TkKpQtOG2b5ZiHcI=
=/+sB
-END PGP SIGNATURE-


Re: WGET :: [Correction de texte]

2008-08-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tom wrote:
 Téléchargement récursif:
   -r,  --recursive  spécifer un téléchargement récursif.
   -l,  --level=NOMBRE   _*profondeeur*_ maximale de récursion (inf
 ou 0 pour infini).
 
 Juste un e à enlever de profondeeur, et ca sera réglé !

This issue appears to have been fixed with the latest French
translation. It will be released with Wget 1.12.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIswBE7M8hyUobTrERAufeAKCIl4ghMvo2JolNfsSAYCTd92v9OwCfS89O
iT3urRXKctZuucXnOn9tGLc=
=v5SC
-END PGP SIGNATURE-


Re: [wish] quiet operation yet displaying the progress

2008-08-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maciej Pilichowski wrote:
 Hello,
 
   I call usually wget from my script in entirely quiet mode however it 
 would be useful if wget could still show the progress -- currently 
 wget either shows a lot of information (and progress) or does not 
 show anything. In short something like this:
 --progress=bar -q -nv
 is understood as
 -q -nv
 
   Please treat such arguments (the former example) as stating show 
 only progress and nothing else.

- -q -nv is a nonsensical combination; they say contradictory things. One
says to emit only a little output; the other says to emit no output at all.

A progress bar for -nv has already been requested, and is tracked at

https://savannah.gnu.org/bugs/index.php?22448

I don't mind putting this into 1.12 if someone wants to write the patch;
otherwise, I probably won't get to it for some time. I've got some
doubts as to whether -nv --progress=bar is the right way to achieve
this: is that the behavior we want if the user specified progress=bar
in their wgetrc file and then gave the -nv command-line option? Then
again, who puts progress=bar in their wgetrc?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIswSO7M8hyUobTrERAnF7AKCFvdBemlyNzH8aq+QcsdOCFOfAKwCdHBft
WADc3rYLGJXpYfgDr/sKS4Q=
=gxKn
-END PGP SIGNATURE-


Re: Wget function

2008-08-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Please keep the list in the replies.

karlito wrote:
 hi thank you for the reply my problem can be fixed on the next  verssion ?
  
 because it's for batch
  
 i have more 1000 url to made so is that why i need to find a solution
  
 also when you mean rename
  
 what is the function to rename with wget ?

I mean, just use the mv or rename command on your operating system.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIswfR7M8hyUobTrERAubkAJ0VL2UPnNQtD27waPVwFkeUwbUp9wCfXerh
dZBr4e7ZBKcEE5Kzrjv1mi8=
=GoKL
-END PGP SIGNATURE-


Re: wget and wiki crawling

2008-08-22 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

asm c wrote:
 I've recently been using wget, and got it working for the most part, but
 there's one issue that's really been bugging me. One of the parameters I
 use is '-R *action=*,*oldid=*' (side note on the platform: ZSH on
 NetBSD on the SDF public access unix system, although I've also used it
 on windows with the same result). The purpose of this parameter is so
 that, when wget crawls a mid-sized wiki I'd like to have a local copy
 of, it doesn't bother with all the history pages, edit pages, and so
 forth. Not downloading these would save me an enormous amount of time.
 Unfortunately, the parameter is ignored until after the php page is
 downloaded. So, because it waits until it's downloaded to delete it,
 using the param doesn't really help at all.
 
 Does anyone know how I can stop wget from even downloading matching pages?

Well, you don't mention it, but I'll assume that those patterns occur in
the query string portion of the URL: that is, they follow a question
mark (?) that appears at some point.

Unfortunately, the -R and -A options only apply to the filename
portion of the URL: that is, whatever falls between the first question
mark, and the first preceding slash (/). Confusingly, it is also then
applied _after_ files are downloaded, to determine whether they should
be deleted after the fact: so Wget probably downloads those files you
really wish it wouldn't, and then deletes them afterwards anyway.

Worse, there's no way around this, currently. This is part of a suite of
problems that are currently slated to be addressed soon. The most
pertinent to your problem, though, is the need for a way to match
against query strings. I'm very much hoping to get around to this before
the next major Wget release, version 1.12. It's being tracked here:

https://savannah.gnu.org/bugs/index.php?22089

If you add yourself to the Cc list, you'll be able to follow along on
its progress.

- --
Cheers!
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIr55d7M8hyUobTrERAu4KAJsHmDTZ46ioEGOTprdE/aTGrj853QCfet84
+c+npJnPwC/86/rLpn5rB8s=
=abdv
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-08-21 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Micah Cowan wrote:
 
 The easiest way to do what you want may be to log in using your browser,
 and then tell Wget to use the cookies from your browser, using
 
 Given the frequency of the login and then download a file use case , it
 should probably be documented on the wiki. (Perhaps it already is. :-)

Yeah, at
http://wget.addictivecode.org/FrequentlyAskedQuestions#password-protected

I think you missed the final sentence of my how-to:

 (I'm going to put this up on the Wgiki Faq now, at
 http://wget.addictivecode.org/FrequentlyAskedQuestions)

:)

(Back to you:)
 Also, it would probably be helpful to have a shell script to automate this.

I filed the following issue some time ago:
https://savannah.gnu.org/bugs/index.php?22561

The report is low on details; but I was envisioning something that would
spew out forms and their fields, accept values for fields in one form,
and invoke the appropriate Wget command to do the submission.

I don't know if it could be _completely_ automated, since it's not 100%
possible for the script to know which form fields are the ones it should
be filling out.

OTOH, there are some damn good heuristics that could be done: I imagine
that the right form (in the event of more than one) can usually be
guessed by seeing which one has a password-type input (assuming
there's also only one of those). If that form has only one text-type
input, then we've found the username field as well. Name-based
heuristics (with pass, user, uname, login, etc) could also help.

If someone wants to do this, that'd be terrific. Could probably reuse
the existing HTML parser code from Wget. Otherwise, it'd probably be a
while before I could get to it, since I've got higher priorities that
have been languishing.

Such a tool might also be an appropriate place to add FF3 sqllite
cookies support.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIrb0s7M8hyUobTrERAlVXAJ9YnAM7JiQrxrB/KclA1FXDnoVswgCdGO7t
Vaa98nhNRuEY4aLMx2BFXm0=
=ScoA
-END PGP SIGNATURE-


Upcoming Wget releases, issue reorganizations

2008-08-21 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

In Savannah, the name of the field value for Planned Release that was
previously 1.13 has just been renamed 1.14, and a new 1.13 target
has been added. I'll be moving some items currently targeted at 1.12 to
1.13, and some items that have just been moved to 1.14 will get moved to
the new 1.13 target.

If you have bookmarks to the 1.13 set of bugs in Savannah, that link now
goes to 1.14.

I've been very happy with the progress and improvements that have been
made to Wget over the last several months. My own productivity, though,
especially in the last couple of months, was somewhat less than I'd
hoped it would be. In particular, taking on co-maintainer
responsibilities with GNU Screen, and a brief hiatus to write GNU Teseq
(a program to aid in debugging Screen), ate up quite a bit of time. I
believe I'm close to stabilizing the balance between my work on Screen
and my work on Wget, but I'm behind where I wanted to be.

In the meantime, we've already got several really terrifically useful
features in the current tree, whose release I'd prefer not to hold back
longer than necessary. I may choose to punt some of the improvements I'd
been planning on Content-Disposition funkiness and such, and code
cleanup, and a bunch of small but not crucial fixes, and really anything
else that looks like it might prevent us from releasing near the turn of
the year.

Steven Schweda's copyright assignment is in for his nice batch of
changes for better VMS build-support and myriad FTP-related fixes; I
need to sift through a lot of that to see what we can pull in as-is and
what I want to adjust somewhat. I'm hoping to get as much of that in for
1.12 as possible - particularly the FTP adjustments, but may need to
punt some of it, even important bugfix pieces, until after the 1.12
release. If that's the case, though, I will ensure that
http://hg.addictivecode.org/wget/schweda/vms/ is kept up-to-date with
mainline, so that it will be essentially functional as a
1.12-plus-Schweda's-changes.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIrfU07M8hyUobTrERAhG7AJ9bv2Q0vetKEcDhfPz2CEQEt+2b3gCeP207
0pu6CNB0sWrsbZqDaWZ7ddA=
=0ObC
-END PGP SIGNATURE-


Congratulations, GSoC students!

2008-08-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Well, today was the final pencils down day for the Google Summer of
Code program.

It's been a great (and quick!) couple of months, and I'm excited by the
results. Saint Xavier and Julien Buty have done great work on IRI/IDN
support and better HTTP Authentication support. The international stuff
will probably be merged into Wget quite soon; the HTTP Authentication
project will be continuing for probably the next couple of months, and
Julien Buty has enthusiastically volunteered to continue working on it
beyond the GSoC program.

I really, really appreciate the work that you've done, and hope that
you've gained some valuable experience as well (or, at least, a couple
of good lines for your CV :) ).

Great job, guys! If either of you ever need a recommendation or a
reference, don't hesitate to ask.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIqdWg7M8hyUobTrERAt3uAJ92Kh7oSLzVffj5Aaay2xNeOQZbdgCfShKo
tIaIz+hlnwP/+2pWQS1e0h8=
=BV8L
-END PGP SIGNATURE-


Re: AW: AW: AW: Problem mirroring a site using ftp over proxy

2008-08-12 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Juon, Stefan wrote:
 Well, here is the index.html (I'm not sure wheter is also accessible in the 
 maillist as I send it as attachement?) 

Sorry, I somehow failed to notice this post. :\

The index.html file that the proxy generated is invalid. Apparently it
wants to tack on ^M (carriage return, \r) after every filename, as a
literal part of the link. It looks like Wget doesn't even acknowledge
links like that; but even if it did, it'd send a request to the proxy like:

  GET /CommonUpdater/avvdat-.zip%0D

rather than

  GET /CommonUpdater/avvdat-.zip

so it would still most likely fail to get a real file (though it _might_
work, if the proxy and/or the FTP server are a little sloppy).

One likely explanation for this, seems to me, is that the proxy gets
back the LIST response like:

  foo CR LF
  bar CR LF

and removes the LFs while leaving in the CR, and spitting them out as
part of the link. That's really poor behavior, considering that FTP
servers _ought_ to send CR LF (and not bare LF), as it's supposed to use
telnet conventions.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIohiL7M8hyUobTrERApkmAJ9Ia9yvahBPtp0aJDZehKciEMc3vQCgjXSC
T9DYFPDUxtBEx6HvOnwBzos=
=MAXZ
-END PGP SIGNATURE-


Re: WGET :: [Correction de texte]

2008-08-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Saint Xavier wrote:
 * Tom ([EMAIL PROTECTED]) wrote:
 Bonjour !
 
 bonjour,
 
 Je souhaite vous informer d'une touche restée appuyée un quart de seconde
 trop longtemps semble-t-il !
 ...
 Téléchargement récursif:
   -r,  --recursive  spécifer un téléchargement récursif.
   -l,  --level=NOMBRE   *profondeeur* maximale de récursion (inf ou 0
 Juste un e à enlever de profondeeur, et ca sera réglé !
 
 En effet, merci !
 
 Micah, instead of profondeeur it should be profondeur.
 Where do you forward that info, French GNU translation team ?
 (./po/fr.po around line 1472)

Yup. The mailing address for the French translation team is at
[EMAIL PROTECTED] The team page is
http://translationproject.org/team/fr.html; other translation teams are
listed at http://translationproject.org/team/index.html

Looks like it's still present in the latest fr.po file at
http://translationproject.org/latest/wget/fr.po

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIoIl77M8hyUobTrERApRkAJsGUybOJEDvYidFXc9OWLJ7gIX66QCeL8we
UsjynplN9Um1gmmWUcyZMbU=
=lqbw
-END PGP SIGNATURE-


Re: Wget and Yahoo login?

2008-08-10 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rick Nakroshis wrote:
 Micah,
 
 If you will excuse a quick question about Wget, I'm trying to find out
 if I can use it to download a page from Yahoo that requires me to be
 logged in using my Yahoo profile name and password.  It's a display of a
 CSV file, and the only wrinkle is trying to get past the Yahoo login.
 
 Try as I may, I just can't seem to find anything about Wget and Yahoo. 
 Any suggestions or pointers?

Hi Rick,

In the future, it's better if you post questions to the mailing list at
wget@sunsite.dk; I don't always have time to respond.

The easiest way to do what you want may be to log in using your browser,
and then tell Wget to use the cookies from your browser, using
- --load-cookies=path-to-browser's-cookies. Of course, this only works
if your browser saves its cookies in the standard text format (Firefox
prior to version 3 will do this), or can export to that format (note
that someone contributed a patch to allow Wget to work with Firefox 3
cookies; it's linked from http://wget.addictivecode.org/, it's
unoffocial so I can't vouch for its quality).

Otherwise, you can perform the login using Wget, saving the cookies to a
file of your choice, using --post-data=..., --save-cookies=cookies.txt,
and probably --keep-session-cookies. This will require that you know
what data to place in --post-data, which generally requires that you dig
around in the HTML to find the right form field names, and where to post
them.

For instance, if you find a form like the following within the page
containing the log-in form:

form action=/doLogin.php method=POST
  input type=text name=s-login
  input type=password name=s-pass
/form

then you need to do something like:

  $ wget --post-data='s-login=USERNAMEs-pass=PASSWORD' \
--save-cookies=my-cookies.txt --keep-session-cookies \
http://HOSTNAME/doLogin.php

(Note that you _don't_ necessarily send the information to the page that
 had the login page: you send it to the spot mentioned in the action
attribute of the password form.)

Once this is done, you _should_ be able to perform further operations
with Wget as if you're logged in, by using

  $ wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \
--keep-session-cookies ...

(I'm going to put this up on the Wgiki Faq now, at
http://wget.addictivecode.org/FrequentlyAskedQuestions)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIn09A7M8hyUobTrERAu04AJ9EgRoBBhvNCDwOt87f91p+HpWktACdFgMM
KEfliBtfrPBbh/XdvusEPiw=
=qlGZ
-END PGP SIGNATURE-


Re: AW: AW: Problem mirroring a site using ftp over proxy

2008-08-08 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Juon, Stefan wrote:

 The point is that wget sends rather a http request than a pure ftp
 command (GET ftp://ftpde.nai.com/CommonUpdater/ HTTP/1.0) which
 causes the proxy to send back a index.html. Do u agree?

Well of course it does: it's using an HTTP proxy. How do you send FTP
commands over HTTP?

The problem isn't that the result is an HTML file; the problem is that
the proxy sends an HTML file that Wget apparently can't parse. Perhaps
the proxy's not really sending an HTML file at all, which would be
unusual (but I'm not sure there are standards governing how FTP gets
proxied across HTTP), in which case Wget would need to be modified to
check whether the proxied results are a listing file. But until you show
us what index.html file Wget is getting, I don't see how we can help.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFInJpC7M8hyUobTrERAhGtAJ9/cY3nJk8xf1oWb+KCH8mQ54nXNACgg/is
xD3eHrajIfnUDaRhnFI+X+s=
=g1QP
-END PGP SIGNATURE-


Re: [PATCH] 1.11.4: Add missing $(datarootdir)

2008-08-08 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maciej W. Rozycki wrote:
 Hello,
 
  Here is a change that adds $(datarootdir) throughout that has been missed 
 despite the prominent warning output by ./configure. :-(
 
 2008-08-09  Maciej W. Rozycki  [EMAIL PROTECTED]
 
   * Makefile.in (datarootdir): Add definition.
   * doc/Makefile.in (datarootdir): Likewise.
   * src/Makefile.in (datarootdir): Likewise.
   * tests/Makefile.in (datarootdir): Likewise.
 
  Please apply.
 
   Maciej

Hi Maciej,

We're not anticipating any further 1.11.x releases for Wget. Active
development for most of the last year has focused on 1.12, which is
based on Automake (so we get datarootdir for free).

But if any significant bugs are found in 1.11.4 that warrant a new
1.11.x release, we'll add this patch in.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFInM4p7M8hyUobTrERAm6zAJ4sLyVEIkq/VVQ2XKylIKPDrNewSwCfUsIH
rFK6XiRKYgVo/yZiU8Nf2iI=
=gLU4
-END PGP SIGNATURE-


Connection management and pipelined Wget

2008-08-07 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 * A getter command is mentioned more than once in the above. Note that
 this is not mutually exclusive with the concept of letting a single
 process govern connection persistence, which would handle the real work;
 the getter would probaby be a tool for communicating with the main driver.

...

 - Using existing tools to implement protocols Wget doesn't understand
 (want scp support? Just register it as an scp:// scheme handler), and
 instantly add support to Wget for the latest, greatest protocols without
 hacking Wget or waiting until we get around to implementing it.

Of course, one drawback is that it then becomes difficult to sanely
handle a feature for multiple simultaneous connections, or even
persistent connections, when outside programs come into play. Using a
getter we have control over, that can communicate with a
connection-managing program, would allow this to work, but that won't
work with outside programs that aren't in the know, such as the scp
command, or other getter programs. You can fork multiple scps for
multiple connections, but what will keep the number of simultaneous
connections to a reasonable limit?

Plus, even the idea of our own getter program communicating via a Unix
socket or some such to a connections manager program, irks me: it
obliterates the independence that makes pipelines useful. I guess, to be
useful, a pipelined Wget would need to have wholly independent tools;
but the loss of persistent connections would be too great a loss to
bear, I think (not that Wget handles them particularly well now:
HTTP/1.1 should significantly improve it, though).

Still, there were already plans to allow arbitrary content handler
commands, and URL filters; we can certainly continue to move in that
direction. We could still split off the HTML and CSS parsers as
completely autonomous (and interchangeable with alternatives) programs.
But it seems to me that content-_fetching_ (protocol support) will need
to continue to be fully integrated in Wget's core. Decisions on whether
URLs are followed or not could also be outsourced.

Previously, I said that we might lose Windows support by making Wget
more pipeline-y; but that's not necessarily true. It's just harder to
implement in Windows, but can be done. Hell, if need be, we could have
Wget write input to a file, then have the parser read it and spit out
another file. That's obviously lame, but OTOH it's how Wget already
parses HTML currently (except that no additional programs are used). I
suspect, though, that such a program would see a Unix-oriented release
some time before the Windows port would appear; unless there were
ongoing collaboration on a Windows port simultaneous to the Unix-ish
development.

If in fact everything except for connections could be handled as an
external command, then there might be little advantage to be gained by
library-izing Wget, and it might make more sense to leaving Wget as a
program, and letting connection handlers be plugins (which are expected
to use Wget's connection management system, rather than direct connections).

Such a project should still probably get a new name (I was going to say
be a fork, but it'd probably be a rearchitecture anyway, with little
in common to current Wget); Wget proper should continue to be a project
that appeals to folks that need a tool that's sufficiently lightweight
to install as a core system component, without a lot of fluff (or at
least, not too much more fluff than it already has).

BTW, I added a couple new name concepts to
http://wget.addictivecode.org/Wget2Names: xget (x being the letter
after w), and niwt (which I like best so far: Nifty Integrated Web Tools).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIm2TG7M8hyUobTrERArRCAJwLkozlzfxEDJcJWBQDiHun6KoMfACeMI61
m7NvCrQ7XAIHTuW7Y9+6wCg=
=yeUz
-END PGP SIGNATURE-


Re: Connection management and pipelined Wget

2008-08-07 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Daniel Stenberg wrote:
 On Thu, 7 Aug 2008, Micah Cowan wrote:
 
 niwt (which I like best so far: Nifty Integrated Web Tools).
 
 But the grand question is: how would that be pronounced? Like newt? :-)

That was my thinking :)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIm2cl7M8hyUobTrERAt33AJ4xEts7QxviDOjRx7L83fr6QkFwrwCbBXy5
MgYGOL0OJRsg5+IpPEI0djY=
=dzkE
-END PGP SIGNATURE-


Re: AW: Problem mirroring a site using ftp over proxy

2008-08-07 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Well, considering that FTP proxied over HTTP is working fine for me,
it's probably more a matter of the index.html file that's generated by
the proxy (since one can't do a true LIST over a proxy). Perhaps you
could supply the index.html files that are being generated (be sure to
clean out any sensitive info first).

It might also be informative to know what server program is doing the
proxying.

- -Micah

Juon, Stefan wrote:
 ...problem exists also with version 1.11.4. So what might cause wget not
 to download the files as it has performed a LIST?
 
 Thanks, Stefan
 
 Juon, Stefan wrote:
 Hi there
 I'm trying to mirror a ftp site over a proxy (Sun Java Webproxy 4.0.4)
 
 using this wget-command:
 
 export ftp_proxy=http://proxy.company.com:8080
 wget --follow-ftp --passive-ftp --proxy=on --mirror 
 --output-file=./logfile.wget ftp://ftpde.nai.com/CommonUpdater
 
 What version of Wget are you running? If it's not the latest, please try
 the current 1.11.4 release.
 
 Please also try the --debug option, to see if Wget gives you more
 information.
 

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIm2fF7M8hyUobTrERAv/BAJ9biwIIUFaIWZ9Ds7IZxiGAKriA7wCeJtn1
lYdaP8hzodianPg1Bp6b6gk=
=+HQo
-END PGP SIGNATURE-


Re: WGET Date-Time

2008-08-07 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andreas Weller wrote:
 Hi!
 I use wget to download files from a ftp server in a bash script.
 For example:
 touch last.time
 wget -nc ftp://[]/*.txt .
 find -newer last.time
 
 This fails if the files on the FTP server are older than my last.time. So I 
 want
 wget to set file date/time to the local creation time not the server's...
 
 How to do this?

You can't, currently. This behavior is intended to support Wget's
timestamping (-N) functionality.

However, I'd accept a patch for an option that disables this.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIm2si7M8hyUobTrERAi9AAJ0f8TUv7TJR6tFsgc4k174rqH6OlgCghCzz
xpemaFdQhODIm0SGp7rJSRA=
=vDKD
-END PGP SIGNATURE-


Re: Problem mirroring a site using ftp over proxy

2008-08-06 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Juon, Stefan wrote:
 Hi there
 I'm trying to mirror a ftp site over a proxy (Sun Java Webproxy 4.0.4)
 using this wget-command:
  
 export ftp_proxy=http://proxy.company.com:8080
 wget --follow-ftp --passive-ftp --proxy=on --mirror
 --output-file=./logfile.wget ftp://ftpde.nai.com/CommonUpdater

What version of Wget are you running? If it's not the latest, please try
the current 1.11.4 release.

Please also try the --debug option, to see if Wget gives you more
information.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFImVZ77M8hyUobTrERAgS7AJ4lWgDuBJonnms+gkriGTZ7LlA4TwCfeNqo
jOtcPq60sVWXb9CA1n6FSnI=
=Z/D4
-END PGP SIGNATURE-


Re: Wget scriptability

2008-08-03 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dražen Kačar wrote:
 Micah Cowan wrote:
 
 Okay, so there's been a lot of thought in the past, regarding better
 extensibility features for Wget. Things like hooks for adding support
 for traversal of new Content-Types besides text/html, or adding some
 form of JavaScript support, or support for MetaLink. Also, support for
 being able to filter results pre- and post-processing by Wget: for
 example, being able to do some filtering on the HTML to change how Wget
 sees it before parsing for links, but without affecting the actual
 downloaded version; or filtering the links themselves to alter what Wget
 fetches.
 
 However, another thing that's been vaguely itching at me lately, is the
 fact that Wget's design is not particularly unix-y. Instead of doing one
 thing, and doing it well, it does a lot of things, some well, some not.
 
 It does what various people needed. It wasn't an excercise in writing a
 unixy utility. It was a program that solved real problems for real
 people.

 But the thing everyone loves about Unix and GNU (and certainly the thing
 that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
 paradigm,
 
 I have always hated that. With a passion.

A surprising position from a user of Mutt, whose excellence is due in no
small part to its ability to integrate well with other command utilities
(that is, to pipeline). The power and flexibility of pipelines is
extremely well-established in the Unix world; I feel no need whatsoever
to waste breath arguing for it, particularly when you haven't provided
the reasons you hate it.

For my part, I'm not exaggerating that it's single-handedly responsible
for why I'm a Unix/GNU user at all, and why I continue to highly enjoy
developing on it.

  find -name '*.html' -exec sed -i \
's#http://oldhost/#http://newhost/#g' \;

  ( cat message; echo; echo '-- '; cat ~/.signature ) | \
gpg --clearsign | mail -s 'Report' [EMAIL PROTECTED]

  pic | tbl | eqn | eff-ing | troff -ms

Each one of these demonstrates the enormously powerful technique of
using distinct tools with distinct feature domains, together to form a
cohesive solution for the need. The best part is (with the possible
exception of the troff pipeline), each of these components are
immediately available for use in some other pipeline that does some
other completely different function.

Note, though, that I don't intend that using Piped-Wget would actually
mean the user types in a special pipeline each time he wants to do
something with it. The primary driver would read in some config file
that would tell wget how it should do the piping. You just tweak the
config file when you want to add new functionality.

  - The tools themselves, as much as possible, should be written in an
 easily-hackable scripting language. Python makes a good candidate. Where
 we want efficiency, we can implement modules in C to do the work.
 
 At the time Wget was conceived, that was Tcl's mantra. It failed
 miserably. :-)

Are you claiming that Tcl's failure was due to the ability to integrate
it with C, rather than its abysmal inadequacy as a programming language
(changing it from an ability to integrate with C, to an absolute
requirement to do so in order to get anything accomplished)?

 How about concentrating on the problems listed in your first paragraph
 (which is why I quoted it)? Could you show us how would a buch of shell
 tools solve them? Or how would a librarized Wget solve them? Or how
 would any other paradigm or architecture or whatever solve them?

It should be trivially obvious: you plug them in, rather than wait for
the Wget developers to get around to implementing it.

The thing that both library-ized Wget and pipeline-ized Wget would offer
is the same: extreme flexibility. It puts the users in control of what
Wget does, rather than just perpetually hearing, sorry, Wget can't do
it: you could hack the source, though. :p

The difference between the two is that a pipelined Wget offers this
flexibility to a wider range of users, whereas a library Wget offers it
to C programmers.

Or how would you expect to do these things without a library-ized (at
least) Wget? Implementing them in the core app (at least by default) is
clearly wrong (scope bloat). Giving Wget a plugin architecture is good,
but then there's only as much flexibility as there are hooks.
Libraryizing Wget is equivalent to providing everything as hooks, and
puts the program using it in the driver's seat (and, naturally, there'd
be a wrapper implementation, like curl for libcurl). A suite of
interconnected utilities does the same, but is more accessible to
greater numbers of people. Generally at some expense to efficiency
(aren't all flexible architectures?); but Wget isn't CPU-bound, it's
network-bound.

As mentioned in my original post, this would be a separate project from
Wget. Wget would not be going away (though it seems likely to me that it
would quickly reach a primarily

Wget scriptability

2008-08-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Okay, so there's been a lot of thought in the past, regarding better
extensibility features for Wget. Things like hooks for adding support
for traversal of new Content-Types besides text/html, or adding some
form of JavaScript support, or support for MetaLink. Also, support for
being able to filter results pre- and post-processing by Wget: for
example, being able to do some filtering on the HTML to change how Wget
sees it before parsing for links, but without affecting the actual
downloaded version; or filtering the links themselves to alter what Wget
fetches.

The original concept before I came onboard, was plugin modules. After
some thought, I'd decided I didn't like this overly much, and have
mainly been leading toward the idea of a next-gen Wget-as-a-library
thing, probably wrapping libcurl (and with a command-client version,
like curl). This obviously wouldn't have been a Wget any more, so would
have been a separate project, with a different name.

However, another thing that's been vaguely itching at me lately, is the
fact that Wget's design is not particularly unix-y. Instead of doing one
thing, and doing it well, it does a lot of things, some well, some not.

So the last couple days I've been thinking, maybe wget-ng should be a
suite of interoperating shell utilities, rather than a library or a
single app. This could have some really huge advantages: users could
choose their own html-parser to use, they can plug in parsers for
whatever filetypes they desire, people who want to implement exotic
features can do that...

Of course, at this point we're talking about something that's
fundamentally different from Wget. Just as we were when we were
considering making a next-gen library version. It'd be a completely
separate project. And I'm still not going to start it right away (though
I think some preliminary requirements and design discussions would be a
good idea). Wget's not going to die, nor is everyone going to want to
switch to some new-fangled re-envisioning of it.

But the thing everyone loves about Unix and GNU (and certainly the thing
that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
paradigm, which is what enables you to mix-and-match different tools to
cover the different areas of functionality. Wget doesn't fit very well
into that scheme, and I think it could become even much more powerful
than it already is, by being broken into smaller, more discreet,
projects. Or, to be more precise, to offer an alternative that does the
equivalent.

So far, the following principles have struck me as advisable for a
project such as this:

 - The tools themselves, as much as possible, should be written in an
easily-hackable scripting language. Python makes a good candidate. Where
we want efficiency, we can implement modules in C to do the work.

 - While efficiency won't be the highest priority (else we'd just stick
to the monolith), it's still important. Spawning off separate processes
to each fetch their own page, initiating a new connection each time,
would be a lousy idea. So, the architectural model should center around
a URL-getter driver, that manages connections and such, reusing
persistent ones as much as possible. Of course, there might be distinct
commands to handle separate types of URLs, (or alternative methods for
handling them, such as MetaLink), and perhaps not all of these would be
able to do persistence (a dead-simple way to add support for scp, etc,
might be to simply call the command-line program).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIlEcX7M8hyUobTrERAqvSAJ9rx99xhU7Zo/xwbKXDbWCWp4jSQwCfbbQM
zmY9j1zYuGq0eNkZnsqR+Jo=
=8wcf
-END PGP SIGNATURE-


Re: wget does not like this URL

2008-07-31 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Kevin O'Gorman wrote:
 Is there a reason i get this:
 [EMAIL PROTECTED] Pending $ wget -O foo 
 http://www.littlegolem.net/jsp/info/player_game_list_txt.jsp?plid=1107gtid=hex;
 Cannot specify -r, -p or -N if -O is given.
 Usage: wget [OPTION]... [URL]...
 [EMAIL PROTECTED] Pending $
 
 While I do have -O, I don't have the ones it seems to think I've specified.
 
 Without the -O foo it works fine, but of course puts the results in
 a different place.
 I get the same error message if I use the long-form parameter.

You most likely have timestamping=on in your wgetrc. -N and -O were
disallowed for version 1.11, but were re-enabled for 1.11.3 (I think)
with a warning. The latest version of wget is 1.11.4.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIkf9U7M8hyUobTrERAtkfAJ9g84lMEkzSeLn24cWQA805HZmE8wCfV2Ck
bB5RK4lRlcBbwOSiU4jPwxM=
=K9cv
-END PGP SIGNATURE-


Re: propose new feature: loading cookies from Firefox 3.0 cookies.sqlite

2008-07-28 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

宋浩 wrote:
 Hi, folks:
 I'm currently using Firefox 3.0 on my Ubuntu 8.04 system. The browser
 saves its cookie file in the same directory as its predecessor Firefox
 2.x, but in a SQLite database file called cookies.sqlite instead of a
 textual file. And I want to add support for this new cookie file format
 into wget. The coding is almost done.
 I'd like to know if anyone else is also working on this.

To be honest, I'd prefer to avoid a dependency on sqlite in Wget, even a
configurable one. I'd much prefer to see a solution based on a separate
program that converts from cookies.sqlite to a cookies.txt file.
Besides, that solution would work with more tools than Wget (do one
thing, and do it well¹).

¹ Not that Wget adheres particularly well to that philosophy...

Lest you think I'm just being unfeeling to your needs, I should point
out that I'm also running on an Ubuntu 8.04, and have found the
sqlite-based cookies files a supreme annoyance. I'd just prefer a more
general, scriptable solution.

However, if you choose to complete this work (you said you're nearly
done), I won't mind if you place a link to your patch on the Wiki front
page (http://wget.addictivecode.org/FrontPage).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIjmZ37M8hyUobTrERApkkAJ9Ns0bt0i7lgCrehQV3Q4RNRYl0eACgiwqR
f3tC07+DhuGfI44tPFuaXDE=
=ncxt
-END PGP SIGNATURE-


Re: wget-1.11.4 bug

2008-07-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

kuang-cheng chao wrote:
 Dear Micah:
  
 Thanks for your work of wget.
  
 There is a question about two wgets run simultaneously.
 In method resolve_bind_address, wget assumes that this is called once.
 However, this will cause two domain name with the same ip if two wgets
 run the same method concurrently.

Have you reproduced this, or is this in theory? If the latter, what has
led you to this conclusion? I don't see anything in the code that would
cause this behavior.

Also, please use the mailing list for discussions about Wget. I've added
it to the recipients list.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIiYKF7M8hyUobTrERAr7fAJ0TnkLdEVOMy6wJA3Z1kIYC7dQoMACfZ9hb
x5K6MTzhgVRCdKJwUGnbSRw=
=EcFC
-END PGP SIGNATURE-


Re: wget-1.11.4 bug

2008-07-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

k.c. chao wrote:
 Micah Cowan wrote:
  Have you reproduced this, or is this in theory? If the latter, what has
  led you to this conclusion? I don't see anything in the code that would
  cause this behavior.

 I reproduce this. But I can't make sure the really problem is in
 resolve_bind_address. In the attached message, both
 api.yougotphogo.com and farm1.static.flickr.com get the same
 ip(74.124.203.218).  The two wget are called from two threads of a
 program.

Yeah, I get 68.142.213.135 for the flickr.com address, currently.

The thing is, though, those two threads should be running wgets under
separate processes (I'm not sure how they couldn't be, but if they
somehow weren't that would be using Wget other than how it was designed
to be used).

This problem sounds much more like an issue with the OS's API than an
issue with Wget, to me. But we'd still want to work around it if it were
feasible.

What operating system are you running? Vista?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIirT17M8hyUobTrERAjsuAJ0crMPYIQficu1csou8Tt0jDFKvpQCeNYk3
1FhXl3uUYj2IA53qI1oOJ8A=
=DvdG
-END PGP SIGNATURE-


Re: Patch to allow filtering on content-type header

2008-07-25 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Lars Kotthoff wrote:
 Hi list,
 
  I've written a patch which allows filtering on the content-type header to
 select what is downloaded. E.g.
 wget -r --content-type=text/* http://www.foobar.com
 will only download things with a content-type header of text/html, text/plain
 etc. There's also a content-type-exclude option to not download specific
 content-types.

Sounds great, Lars!

In fact, we already have an RFE on the bug-tracker for just such a thing
at https://savannah.gnu.org/bugs/?20378; if you'd like to attach it
there, that'd be great.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFIirbU7M8hyUobTrERAlUHAJ9pFEOOgspdiYXE54Wg0nD4+e3udgCWMPjM
+muSJuWzt8yJwIlTO3oJbQ==
=+jBB
-END PGP SIGNATURE-


Re: trouble with -p

2008-07-24 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brian Keck wrote:
 (It also renames diggthis.js to diggthis.js.html, but I don't care about
 that).

That's an indication that the server is misconfigured, and is serving
diggthis.js as text/html, rather than text/javascript or text/x-javascript.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIiD4k7M8hyUobTrERAoJEAJ4q0N4lxfkDoQNtx62QMkGHXxmAlwCeIEdd
NKprZGCw4lfMx/jybi/qriM=
=Egpr
-END PGP SIGNATURE-


Re: Wget

2008-07-22 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hor Meng Yoong wrote:
 Hi:
 
   I understand that you are a very busy person. Sorry to disturb you.

Hi; please use the mailing list for support requests. I've copied the
list in my response.

   I am using wget to mirror (using ftp://) a user home directory from a
 unix machine. Wget default to the user's home directory. However, I also
 need to get /etc folder. So, I tried to use ../../../etc. It works but
 the result of the ftpped files are in %2E%2E/ %2E%2E/ %2E%2E
 
 Any means to overcome this, or rename the directory.

Try the -nd option (you may also need -nH). You might prefer to fetch
/etc in a separate invocation from the other things; perhaps with the -P
option to specify a directory name.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIhi5O7M8hyUobTrERAl+YAJ9xaX5NivhEfzJLHKD5T3qs0nZuOACgg0eC
IqFZMlz8obK+loKyQ6vXCWo=
=gNqH
-END PGP SIGNATURE-


Re: trouble with -p

2008-07-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brian Keck wrote:
 Hello,
 
 If you do
 
 wget http://www.ifixit.com/Guide/First-Look/iPhone3G
 
 then you get an HTML file called iPhone3G.
 
 But if you do
 
 wget -p http://www.ifixit.com/Guide/First-Look/iPhone3G
 
 then you get a directory called iPhone3G.  
 
 This makes sense if you look at the links in the HTML file, like
 
 /Guide/First-Look/iPhone3G/images/3jYKHyIVrAHnG4Br-standard.jpg
 
 But of course I want both.  Is there a way of getting wget -p to do
 something clever, like renaming the HTML file?  I've looked through
 wget(1)  /usr/share/doc/wget  the comments in the 1.10.2 source
 without seeing anything relevant.

That strikes me as not quite right. If Wget sees
http://www.ifixit.com/Guide/First-Look/iPhone3G, and it's not redirected
to http://www.ifixit.com/Guide/First-Look/iPhone3G/, then Wget will use
a file name. What's more, if it later sees it with the slash, it will
fail to create a directory at all, since the file already exists with
that pathname.

I'm not sure what you mean by I want both. You can't possibly have a
regular file named iPhone3G, and another file named iPhone3G/images/...
it can't be both a file and a directory at once.

If you specify the link with a trailing slash, then Wget will realize
iPhone3G is a directory, and will store the file it finds there as
iPhone3G/index.html. You're out of luck, though, if some links refer to
it with, and some without, the trailing slash, with a server that
doesn't redirect to the slash version (like Apache does).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIgiPA7M8hyUobTrERAmq8AJ96TyBcrdI0YB06Z2tODRCMSI22AgCggESe
jgXOMQ+uNMupbgq0vJZByv0=
=jzGB
-END PGP SIGNATURE-


Re: trouble with -p

2008-07-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

James Cloos wrote:
 Micah == Micah Cowan [EMAIL PROTECTED] writes:
 
 Micah I'm not sure what you mean by I want both. 
 
 He means that, when the -p option is given, he wants to mangle either
 the created filename or the created directory name so that both do in
 fact get created on the filesystem and all related files get saved.
 
 Perhaps delaying the initial open(2) until after parsing the first
 document and then pretending that the initial URL had a trailing
 solidus might work?

Not possible with the current architecture. And that wouldn't solve the
problem if it happens not to appear that way in the links immediately
contained within.

https://savannah.gnu.org/bugs/index.php?23756 covers my solution for
handling this.

The easy workaround for now, though, would be to supply the URL with the
solidus in the first place, though as mentioned, I'm not sure that will
work if it then later encounters a version without the solidus.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIgjPS7M8hyUobTrERArzeAJ90f55hIfPc4Rg/+q/mey7fNXQj9ACfV8ZL
TNzLJKLVkB2J6EVJcMbwqW4=
=jKGB
-END PGP SIGNATURE-


Re: rapidshare download problem

2008-07-17 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Doruk Fisek wrote:
 Hi,
 
  I'm having trouble cookieless downloading from rapidshare with the
 latest version of wget.
 
  When I use a url like;
  
  http://username:[EMAIL PROTECTED]/files/30168760/Rapidshare_EN.txt
 
  wget 1.10.2 downloads it just fine but wget 1.11.4 brings an html page
 instead.

See if --auth-no-challenge fixes it for you.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIfxuC7M8hyUobTrERAt/mAJ97QRCx4mTJKEbSyrql8hsy7Vty3QCeOc5/
GI8fqQaVyLjrx9x/nMgSdNM=
=wZbY
-END PGP SIGNATURE-


Building [Re: CSS support now in mainline]

2008-07-12 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 I'm pleased to report that the paperwork has been finalized for the
 assignment of copyright over Ted Mielczarek's CSS support to the FSF.
 That support has now been merged into the mainline repository, and the
 separate css repository has been removed.

Note that this introduces a new build requirement when building from the
repo: flex (or lex) is now required.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIeRYD7M8hyUobTrERAjAbAKCKhuqSVBoAqiPD82SN/RHIoI7IDwCfehUl
EAwy+zpLCUKE86EcjFIHVzE=
=9+c7
-END PGP SIGNATURE-


Re: WGET bug...

2008-07-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

HARPREET SAWHNEY wrote:
 Hi,
 
 I am getting a strange bug when I use wget to download a binary file
 from a URL versus when I manually download.
 
 The attached ZIP file contains two files:
 
 05.upc --- manually downloaded
 dum.upc--- downloaded through wget
 
 wget adds a number of ascii characters to the head of the file and seems
 to delete a similar number from the tail.
 
 So the file sizes are the same but the addition and deletion renders
 the file useless.
 
 Could you please direct me on if I should be using some specific
 option to avoind this problem?

In the future, it's useful to mention which version of Wget you're using.

The problem you're having is that the server is adding the extra HTML at
the front of your session, and then giving you the file contents anyway.
It's a bug in the PHP code that serves the file.

You're getting this extra content because you are not logged in when
you're fetching it. You need to have Wget send a cookie with an
login-session information, and then the server will probably stop
sending the corrupting information at the head of the file. The site
does not appear to use HTTP's authentication mechanisms, so the
[EMAIL PROTECTED] bit in the URL doesn't do you any good. It uses
Forms-and-cookies authentication.

Hopefully, you're using a browser that stores its cookies in a text
format, or that is capable of exporting to a text format. In that case,
you can just ensure that you're logged in in your browser, and use the
- --load-cookies=cookies.txt option to Wget to use the same session
information.

Otherwise, you'll need to use --save-cookies with Wget to simulate the
login form post, which is tricky and requires some understanding of HTML
Forms.

- --
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFId9Vy7M8hyUobTrERAjCWAJ9niSjC5YdBDNcAbnBFWZX6D8AO7gCeM8nE
i8jn5i5Y6wLX1g3Q2hlDgcM=
=uOke
-END PGP SIGNATURE-


Re: WGET bug...

2008-07-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

HARPREET SAWHNEY wrote:
 Hi,
 
 Thanks for the prompt response.
 
 I am using
 
 GNU Wget 1.10.2
 
 I tried a few things on your suggestion but the problem remains.
 
 1. I exported the cookies file in Internet Explorer and specified
 that in the Wget command line. But same error occurs.
 
 2. I have an open session on the site with my username and password.
 
 3. I also tried running wget while I am downloading a file from the
 IE session on the site, but the same error.

Sounds like you'll need to get the appropriate cookie by using Wget to
login to the website. This requires site-specific information from the
user-login form page, though, so I can't help you without that.

If you know how to read some HTML, then you can find the HTML form used
for posting username/password stuff, and use

wget --keep-session-cookies --save-cookies=cookies.txt \
- --post-data='username=foopassword=bar' ACTION

Where ACTION is the value of the form's action field, USERNAME and
PASSWORD (and possibly further required values) are field names from the
HTML form, and FOO and BAR is the username/password.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFId+w97M8hyUobTrERAmLsAJ91231iGeO/albrgRuuUCRp8zFcnwCgiX3H
fDp2J2oTBKlxW17eQ2jaCAA=
=Khmi
-END PGP SIGNATURE-


CSS support now in mainline

2008-07-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I'm pleased to report that the paperwork has been finalized for the
assignment of copyright over Ted Mielczarek's CSS support to the FSF.
That support has now been merged into the mainline repository, and the
separate css repository has been removed.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFId/SU7M8hyUobTrERAt2LAKCPefHXjUjeYnnBtNMFeO5gXvewSgCfbXS4
XzCH+ET6E5zY0BiiBiozdjo=
=CzEz
-END PGP SIGNATURE-


[RESOLVED] Re: Release: GNU Wget 1.11.4

2008-07-01 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Robert Denton wrote:
 The other guy's trick of sending the unsubscribe request from a
 different email address worked!  Now that I am unsubscribed however,
 I cannot share that with the list.  Would you do the honors for me?
 Thanks!
 
 Robert

Thanks, Doug, for pointing that out.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIamzk7M8hyUobTrERAqKDAJ9VbHOl59PSPtk1rhK8O5HsTx6L7gCfS5Ks
IdrWlyK/uidNPeROEnBFEtw=
=fieg
-END PGP SIGNATURE-


Re: Mailing list migration?

2008-07-01 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Madhusudan Hosaagrahara wrote:
 Hi Micah,
   My  suggestion would be to choose the option that minimizes the
 amount of time and effort required to maintain these lists.
   What do you think of using an external tool like
 https://savannah.gnu.org/maintenance/ListServer or offloading mail to
 3rd party apps like http://www.google.com/a/help/intl/en/index.html or
 http://smallbusiness.officelive.com/GetOnline/Domain
   Last, I'm curious if any attempts have been made to get http://wget.org
 ~Madhu.

The Savannah one I believe is an interface to the existing
[EMAIL PROTECTED] (though the latter predates the former). I didn't
realize that shell access was a possibility. It's not root, but it's
nice to have.

I'd probably prefer to use Gnu's over Google.

As to wget.org, looks like it's registered to someone in China, I don't
think I'm going to spend much effort trying to get it.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIaxZI7M8hyUobTrERAvOjAJ9g2rF16F/eVZy2+wYD8TnzFa1/NACgiq0m
lS9GLfeuINj2m3vt+GCQNpI=
=x8L2
-END PGP SIGNATURE-


Release: GNU Wget 1.11.4

2008-06-30 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Announcing GNU Wget 1.11.4, a bugfix release.

The source code is available at
  - http://ftp.gnu.org/gnu/wget/
  - ftp://ftp.gnu.org/gnu/wget/

Documentation is at
  - http://www.gnu.org/software/wget/manual/

More information about Wget is on the official GNU web page at
http://www.gnu.org/software/wget/, and on the Wget Wgiki,
http://wget.addictivecode.org/

Here's the relevant NEWS entries:

* Changes in Wget 1.11.4

** Fixed an issue (apparently a regression) where -O would refuse to
download when -nc was given, even though the file didn't exist.

** Fixed a situation where Wget could abort with --continue if the
remote server gives a content-length of zero when the file exists
locally with content.

** Fixed a crash on some systems, due to Wget casting a pointer-to-long
to a pointer-to-time_t.

** Translation updates for Catalan.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIaGJs7M8hyUobTrERApiRAJ9H7aElziLZ6qQrSHiG4YyaZBSG5wCfaI9J
EFfMg67SazmrKekuxvq6zX8=
=+IwS
-END PGP SIGNATURE-


Re: Release: GNU Wget 1.11.4

2008-06-30 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Robert Denton wrote:
 Hi, I have sent a few emails to:
 
 [EMAIL PROTECTED]
 
 but they keep bouncing (blocked by SpamAssassin).  Is there any other
 way to get off this list?  Thanks!

I'm afraid there's nothing we can do here. :\ Please contact
[EMAIL PROTECTED] to fix this.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIaYSE7M8hyUobTrERAqWoAJ9oDhJYj+PnswyaVqkzr/fQK7mukgCfceXO
AXQfa37aG2HWyufHVaxqSEs=
=ZvWk
-END PGP SIGNATURE-


Mailing list migration?

2008-06-30 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I'm thinking it may be appropriate at this time to broach the subject of
a mailing list migration.

In the year that I've been the maintainer for GNU Wget, the mailing list
at dotsrc.org has gone down twice, for several days at a time, and I'm
concerned about whether we might expect further such difficulties in the
future.

When we do have issues, there's a tendency for responses to be a bit
slow. This is understandable, as dotsrc is a small, volunteer-run
organization serving the needs of many projects. But it would be nice to
 have more direct control over the service: for instance, to unsubscribe
people when they have trouble doing so themselves (and, perhaps, to
ensure that the spam blocker never affects unsubscribe attempts from
subscribed addresses).

Though it hasn't proven to be a problem yet, I think it would be helpful
to have unsubscribe or moderation ability, in the event that some
threads or posters get a little out of hand.

The downsides, of course, will be the temporary pain of moving to a new
address, the potential to lose some subscribers with the move, and
moving the current archives over to use the new mailing list.

The ideal upsides would be, more reliable service, and more direct
control over the subscription list and spam controls.

The two possibilities I can think of, are:

 - Set up a new mailing list at addictivecode.org (my VPS, where the
Wiki and source repos are at). The infrastructure is there already
(being used for [EMAIL PROTECTED]; there was also a
wget-committers list for folks with commit access, which is no longer
used). This has the advantage that the Wget maintainer will have root
access (so long as it continues to be me ;). The disadvantages are that
I may not have the time to spend that a dedicated sysadmin might, and
I'm not sure what kind of uptime I can guarantee, as services tend to
drop (OOM-killed) when Apache gets hit hard. There are ways around this,
but I haven't had time to spend on looking seriously at it. So far,
though, my uptimes have been a bit better than dotsrc's, at least.

 - Use [EMAIL PROTECTED] as the primary mailing list once again, and ask
the dotsrc folks to forward wget@sunsite.dk there. This has the
advantage that I will have control over the subscription list and
various other admin-level things (I hope?), and the GNU admins can
probably do a better job (maybe?) than either I or the dotsrc folks can,
at keeping services running smoothly.

What do y'all think?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIaY1v7M8hyUobTrERAgLGAJ0TnlZnNM/25UpibZZEpyr9zJrqxgCgjynV
Ap13NbW09sybmsZ7LbTBX/0=
=vOS0
-END PGP SIGNATURE-


Re: No downloading

2008-06-29 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

 On Sun, Jun 29, 2008 at 1:42 PM, Mishari Almishari [EMAIL PROTECTED] wrote:
 Hi,
 I want to download the website www.2006election.net

 For that, I used the command
 wget -d -nd -p -E -H -k -K -S -R png,gif,jpg,bmp,ico  --ignore-length
 --user-agent=Mozilla -e robots=off -P www.2006election.net -o
 www.2006election.net.out  http://www.2006election.net;

 But the downloaded page index.html has no content (except body/head tags),
 eventhough i can see the content when i used internet exprolorer.

mm w wrote:
 the default index is not named index, or there is a HTTP test
 server/side regarding HTTP_USER_AGENT

The first one could not possibly cause problems, since he's not
requesting any URLs with index.html in them.

The HTTP_USER_AGENT thing is the problem. Mishari tried to specifically
handle this with the --user-agent line, but it apparently wasn't
convincing enough. I got it to work with:

  --user-agent='Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET
CLR 1.1.4322)'

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIaAOl7M8hyUobTrERAhldAJ9Ivi2zEQ5MZQ1fIdResHqPDhtnuACgj1Y+
kNGIgq2MS8tPXxkXoKpNVPw=
=IhL+
-END PGP SIGNATURE-


Re: No downloading

2008-06-29 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Petr Pisar wrote:
 On 2008-06-29, Mishari Almishari [EMAIL PROTECTED] wrote:
 Hi,
 I want to download the website
 www.2006election.nethttp://www.2006election.net.out/
 

 But the downloaded page index.html has no content (except body/head tags),
 eventhough i can see the content when i used internet exprolorer.

 This is not bug, that's feature. All the content you see in IE is
 generated by JavaScript. See source code of the web page in IE.

No, the command he gives literally yields a completely empty web page:

html
body
/body
/html

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIaAcD7M8hyUobTrERAmyUAJ0XSHavTRur8J0eMfk4CY/Ck4p+ngCfa+gU
mPn+vwgASK5iPH2J2WTtpWI=
=21dD
-END PGP SIGNATURE-


Re: Handling Ajax (was Re: No downloading)

2008-06-29 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Paul King wrote:
 I just want to de-lurk for a minute. I have been using wget on a regular 
 basis 
 for various websites. 
 
 If Javascript is responsible for writing the content, then you have a web 
 page 
 that probably uses AJAX, and would be dyanmically updateable. Since Ajax use 
 is 
 on the rise, I wonder if anyone here can say how does wget deal with sites 
 using Ajax?

Not so well, generally speaking. Wget isn't going to do any
JavaScript-interpreting on it's own, so it really depends. If the
JavaScript was written in certain ways, it's possible it will just
magically work when you fire it up in your browser. It's not unlikely
that it fails miserably. :\

Ultimately, I think it depends on the site.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIaBoR7M8hyUobTrERAszlAJ9nf8WyaMYFuu2+hNgn8hLCfBzMBgCdGAZL
DD0EfFfeyCxV7MiRw8eVHMs=
=LGpk
-END PGP SIGNATURE-


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Coombe, Allan David (DPS) wrote:
 
 However, the case of the files on disk is still mixed - so I assume
 that wget is not using the URL it originally requested (harvested
 from the HTML?) to create directories and files on disk.  So what
 is it using? A http header (if so, which one??).
 
 I think wget uses the case from the HTML page(s) for the file name;
 your proxy would need to change the URLs in the HTML pages to lower
 case too.

My understanding from David's post is that he claimed to have been doing
just that:

 I modified the response from the web site to lowercase the urls in
 the html (actually I lowercased the whole response) and the data that
 wget put on disk was fully lowercased - problem solved - or so I
 thought.

My suspicion is it's not quite working, though, as otherwise
where would Wget be getting the mixed-case URLs?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz
Vl7REl1hNVNqdBrLqoygrcE=
=jlBN
-END PGP SIGNATURE-



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-22 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Coombe, Allan David (DPS) wrote:
 OK - now I am confused.
 
 I found a perl based http proxy (named http::proxy funnily enough)
 that has filters to change both the request and response headers and
 data.  I modified the response from the web site to lowercase the urls
 in the html (actually I lowercased the whole response) and the data that
 wget put on disk was fully lowercased - problem solved - or so I thought.
 
 However, the case of the files on disk is still mixed - so I assume that
 wget is not using the URL it originally requested (harvested from the
 HTML?) to create directories and files on disk.  So what is it using? A
 http header (if so, which one??).

I think you're missing something on your end; I couldn't begin to tell
you what. Running with --debug will likely be informative.

Wget uses the URL that successfully results in a file download. If the
files on disk have mixed case, then it's because it was the result of a
mixed-case request from Wget (which, in turn, must have either resulted
from an explicit argument, or from HTML content).

The only exception to the above is when you explicitly enable
- --content-disposition support, in which case Wget will use any filename
specified in a Content-Disposition header. Those are virtually never
issued, except for CGI-based downloads (and you have to explicitly
enable it).

- --
Good luck!
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIXe0Z7M8hyUobTrERAkF5AJ9FOkx5XQJCx9vkTV9xr2zbYzp4jwCffrec
zhdtjp59GOwt07YgvtolM8o=
=FZ3m
-END PGP SIGNATURE-


Re: help with accessing Google APIs

2008-06-20 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ryan Schmidt wrote:
 On Jun 20, 2008, at 4:47 PM, [EMAIL PROTECTED] wrote:
 I get the following error:

 --17:42:58--  http://ajax.googleapis.com/ajax/services/search/web?v=1.0
= [EMAIL PROTECTED]'
 Resolving ajax.googleapis.com... 66.102.1.100, 66.102.1.101,
 66.102.1.102, ...
 Connecting to ajax.googleapis.com|66.102.1.100|:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: 81 [text/javascript]

 0K   100%   
 6.79 MB/s

 17:42:58 (6.79 MB/s) - [EMAIL PROTECTED]' saved [81/81]

 'q' is not recognized as an external or internal command, operable
 program or batch file.
 
 
 Your shell appears to think the  in the URL has not been escaped. I'm
 not sure why it thinks that, since you've enclosed it in single quotes,
 which should be sufficient. And copying and pasting your command to my
 terminal (replacing curl with wget) works for me:

The fact that Wget transcodes '?' to '@' is a pretty good sign the user
is running Windows, so I'm going to assume that. In that case, AIUI,
single-quotes don't work the same as they would in a Unix shell: the
user needs to use double-quotes instead.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIXCp47M8hyUobTrERAo2sAJ9+VnsSA74BA9AmfLHqu++TTAgiPACgirot
EfLt/jBNKruR8sI/2M/724E=
=gUSE
-END PGP SIGNATURE-


Re: Does --page-requisites load content from other hosts?

2008-06-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Stefan Nowak wrote:
 Does --page-requisites load content from other hosts as well, or must I
 explicitly issue a --span-hosts with it?
 
 The manpage unambiguously says about --span-hosts Enable spanning
 across hosts when doing recursive retrieving, but at the --span-hosts
 section it does not mention whether wget will load from other hosts or
 only the mother host.
 
 Please reply in CC to me, and also update the manpage with the information.

- --page-requisites invokes a special kind of recursion (and the manpage
says this), so the manpage is  pretty clear about what's required (i.e.,
yes, you need --span-hosts, just as you would for -r).

The manual makes this even more clear in the following text:

Actually, to
download a single page and all its requisites (even if they exist
on separate websites), and make sure the lot displays properly
locally, this author likes to use a few options in addition to -p:
 
wget -E -H -k -K -p http://site/document

(Note, btw, that the authoritative source for information about Wget is
the info manual, not the man page.)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIWpve7M8hyUobTrERAihmAJ0Sm0uNTn6WBH69qvmtAUuSZ7n9awCfSZL6
4B0EpM/EaLptbHDM70cJJyo=
=x7G7
-END PGP SIGNATURE-


Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ryan Schmidt wrote:
 For example, if you want American English, set LANG to en_US.
 
 In the Bash shell, you can type export LANG=en_US
 
 In the Tcsh shell, you can type setenv LANG en_US
 
 To find out which shell you use, type echo $SHELL

FYI: It's not in any current release, but current mainline has support
for the special [EMAIL PROTECTED] for LANGUAGE (still may need to set
LANG=en_US or something). This causes all quoted strings to be rendered
in boldface, using terminal escape sequences. I've found it pleasant to
use that setting for my own purposes.

The [EMAIL PROTECTED] LANGUAGE setting is also supported (converts to proper
left/right-quotemarks, but no terminal sequences); but I've rigged
LANG=en_US to have the same effect ([EMAIL PROTECTED] is copied to en_US.po).

Again, this is only in the mainline repo, and not in any release.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIWXvT7M8hyUobTrERAmedAJ44nMxqJCyIBox1LDv/FOibkCslIACeLoS3
Beb0toZwvx29J4Sa3AZk62k=
=Sreb
-END PGP SIGNATURE-


Re: Help with a core dump please

2008-06-16 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Valentin wrote:
 Hi,
 I'm trying to mirror a site with this command:
 wget -nd -r -k -p -H -c -T 10
 -t 2 http://www.freesfonline.de/Magazines1.html
 It works fine, until at some point it tries to get
 http://www.booksense.com/robots.txt and core dumps. When I try
 downloading just that file there's no problem. Is there some way to
 increase wget's verbosity or another way of debugging this? I have
 version 1.10.2-3ubuntu1.

FYI, Valentin caught me online for IRC. It looked like a problem we'd
fixed in 1.11, but actually, it's still present, and a new bug report
has been filed:

https://savannah.gnu.org/bugs/?23613

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIVyiV7M8hyUobTrERAuTaAJ9a61N6txpABBVhizVKYEiAiVVHQgCeP8sY
TZe7Qpww5ejINO60c2A9QxM=
=88jF
-END PGP SIGNATURE-


Re: bug in wget

2008-06-14 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sir Vision wrote:
 Hello,
 
 enterring following command results in an error:
 
 --- command start ---
 c:\Downloads\wget_v1.11.3bwget
 ftp://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-mozilla1.8-l10n/;
 -P c:\Downloads\
 --- command end ---
 
 wget cant convert .listing-file into a html-file

As this seems to work fine on Unix, for me, I'll have to leave it to the
Windows porting guy (hi Chris!) to find out what might be going wrong.

...however, it would really help if you would supply the full output you
got, from wget, that leads you to believe Wget couldn't do this
conversion. in fact, it wouldn't hurt to supply the -d flag as well, for
maximum debugging messages.

- --
Cheers,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIVKXx7M8hyUobTrERAo40AKCAmwgDOGgjU2kcTYeEGC3+RkCjzQCeJt6B
dz38DW8jMMZtUxc+FhvIhfI=
=T+mK
-END PGP SIGNATURE-


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Micah Cowan wrote:
 
 Unfortunately, nothing really comes to mind. If you'd like, you
 could file a feature request at 
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an
 option asking Wget to treat URLs case-insensitively.
 
 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the
 process. I think you probably want to send the original case to the
 server (just in case it really does matter to the server). If you're
 going to treat different case URIs as matching then the lower-case
 version will have to be stored in the hash. The most important part
 (from the perspective that Allan voices) is that the versions written
 to disk use lower case characters.

Well, that really depends. If it's doing a straight recursive download,
without preexisting local files, then all that's really necessary is to
do lookups/stores in the blacklist in a case-normalized manner.

If preexisting files matter, then yes, your solution would fix it.
Another solution would be to scan directory contents for the first name
that matches case insensitively. That's obviously much less efficient,
but has the advantage that the file will match at least one of the
real cases from the server.

As Matthias points out, your lower-case normalization solution could be
achieved in a more general manner with a hook. Which is something I was
planning on introducing perhaps in 1.13 anyway (so you could, say, run
sed on the filenames before Wget uses them), so that's probably the
approach I'd take. But probably not before 1.13, even if someone
provides a patch for it in time for 1.12 (too many other things to focus
on, and I'd like to introduce the external command hooks as a suite,
if possible).

OTOH, case normalization in the blacklists would still be useful, in
addition to that mechanism. Could make another good addition for 1.13
(because it'll be more useful in combination with the rename hooks).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
nVYivipui+0TRmmK04kD2JE=
=OMsD
-END PGP SIGNATURE-


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Allan,

You'll generally get better results if you post to the mailing list
(wget@sunsite.dk). I've added it to the recipients list.

Coombe, Allan David (DPS) wrote:
 Hi Micah,
 
 First some context…
 We are using wget 1.11.3 to mirror a web site so we can do some offline
 processing on it.  The mirror is on a Solaris 10 x86 server.
 
 The problem we are getting appears to be because the URLs in the HTML
 pages that are harvested by wget for downloading have mixed case (the
 site we are mirroring is running on a Windows 2000 server using IIS) and
 the directory structure created on the mirror have 'duplicate'
 directories because of the mixed case.
 
 For example,  the URLs in HTML pages /Senate/committees/index.htm and
 /senate/committees/index.htm refer to the same file but wget creates 2
 different directory structures on the mirror site for these URLs.
 
 This appears to be a fairly basic thing, but we can't see any wget
 options that allow us to treat URLs case insensetively.
 
 We don't really want to post-process the site just to merge the files
 and directories with different case.

Unfortunately, nothing really comes to mind. If you'd like, you could
file a feature request at
https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option
asking Wget to treat URLs case-insensitively. Finding local files
case-insensitively, on a case-sensitive filesystem, would be a PITA; but
adding and looking up URLs in the internal blacklist hash wouldn't be
too hard. I probably wouldn't get to that for a while, though.

Another useful option might be to change the name of index files, so
that, for instance, you could have URLs like http://foo/ result in
foo/index.htm or foo/default.html, rather than foo/index.html.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUG937M8hyUobTrERAqq2AJ48mGvcFCSxnouTFqYTuRHzVgwYdgCeLegI
vkdzf3Lu+Vn5diCOHk5CRhc=
=IlG9
-END PGP SIGNATURE-


Re: FW: GNU Coding Standard compliance

2008-06-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Chris,

Wouldn't the Cygwin-specific version be preferable to existing Cygwin
users (which would include me, on occasion)? In particular, things like
the default --restrict=windows setting might be less than desirable. I
imagine most Cygwin users are looking for a Windows Wget that behaves
more like Unix Wget (since Cygwin is essentially Posix for Windows).

There's also the fact that Wget-1.10.2 is already a Cygwin package,
which could do with the updating, and is probably important to the
Cygwin set as a whole.

Also, I'm not sure that Eric is subscribed to the ML, so he may not have
gotten your message (I've added him to the recipients).

- -Micah

Christopher G. Lewis wrote:
 Eric - 
 
   Why are you trying to package Wget for cygwin when there is 
 a *native* win32 exe?  Seems like a whole *lot* of work for 
 something that really doesn't gain you anything.
 
   I'm quite interested in your response.
 
 Chris
 
 Christopher G. Lewis
 http://www.ChristopherLewis.com
  
 
 -Original Message-
 From: Eric Blake [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, June 04, 2008 7:52 AM
 To: [EMAIL PROTECTED]
 Subject: GNU Coding Standard compliance

 I'm trying to package wget-1.11.3 for cygwin.  But you have 
 several GNU
 Coding Standard compliance problems that is making this task more
 difficult than it should be.  GCS requires that your 
 testsuite be run by
 'make check', but yours is a no-op.   Instead, you provide 
 'make test',
 but that fails to compile if you use a VPATH build.  And even 
 when using
 an in-tree build, it fails as follows:
 
 ./Test-proxied-https-auth.px  echo  echo
 /bin/sh: ./Test-proxied-https-auth.px: No such file or directory
 
 After commenting that line out, the following tests are 
 also missing:
   ./Test-proxy-auth-basic.px
   ./Test-N-current-HTTP-CD.px
 
 Test-N-HTTP-Content-Disposition.px fails, since it didn't add the
 --content-disposition flag to the wget invocation.
 
 Several Test--spider-* tests fail, because an expected error 
 code of 256
 is impossible (exit status is truncated to 8 bits).
 
 Also, your hand-rolled Makefile.in don't support 
 --datarootdir.  I'm not
 sure whether you are interested in migrating to using 
 Automake, which
 would solve a number of these issues; let me know if you would be
 interested in such a patch.
 


- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFITZth7M8hyUobTrERApkTAJ95Xll+H1vZaMYtrBRgRGedFUGP1QCZAVeP
JPBle23eqa0JpuCIdX37c6U=
=yjZr
-END PGP SIGNATURE-


Re: FW: GNU Coding Standard compliance

2008-06-09 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 Also, I'm not sure that Eric is subscribed to the ML, so he may not have
 gotten your message (I've added him to the recipients).

(Obviously, this was rectified, and well before I wrote this. However,
it apparently took four days for either of these messages to be
delivered to sunsite.dk from gnu.org, so I hadn't gotten the fixed
version before I sent this.)

Hm, according to http://dotsrc.org/, their servers were down, so we may
be catching up for a bit here as delayed mails are retried over the next
couple days.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFITZ0n7M8hyUobTrERAjXOAJ9YCcwXz+gmC4wEjIj8wmF5ggpLSACcD+hA
hvDA5+9BLJH9qIXaB2QHJoA=
=zjc9
-END PGP SIGNATURE-


Re: getpass alternative [Re: getpass documentation]

2008-06-04 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Matthew Woehlke wrote:
 Micah Cowan wrote:
 Wget just added code to the dev repo that uses gnu_getpass to support
 password prompting. This was mainly because it was quick-and-easy, and
 because gnu_getpass doesn't suffer from many of the serious flaws
 plaguing alternative implementations.
 
 Hehe, earlier today I merged my old, lame-but-functional patch with
 1.11.3  (I've changed systems since last time). Does this mean that when
 fedora picks up 1.12 (after there *is* a 1.12 obviously :-) ) that I
 won't need to roll my own any more? ;-)

That's the plan.

I guess you're quoting from my post to the gnulib list? :)

For those not in the loop, the context of the thread above is discussion
of a more general password-getting solution, for folks that don't need
something that adheres to the (formerly) standard Unix getpass interface.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIRjOl7M8hyUobTrERAr85AJ922yvYJuNz6ZCB3isah8kwguWSnwCeKKcA
rOm0SJPErGDtt7VaLgg9J5w=
=7Mz/
-END PGP SIGNATURE-


Re: GNU Coding Standard compliance

2008-06-04 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Eric Blake wrote:
 I'm trying to package wget-1.11.3 for cygwin.  But you have several GNU
 Coding Standard compliance problems that is making this task more
 difficult than it should be.  GCS requires that your testsuite be run by
 'make check', but yours is a no-op.   Instead, you provide 'make test',
 but that fails to compile if you use a VPATH build.  And even when using
 an in-tree build, it fails as follows:
 
 ./Test-proxied-https-auth.px  echo  echo
 /bin/sh: ./Test-proxied-https-auth.px: No such file or directory
 
 After commenting that line out, the following tests are also missing:
 
 ./Test-proxy-auth-basic.px
 ./Test-N-current-HTTP-CD.px
 
 Test-N-HTTP-Content-Disposition.px fails, since it didn't add the
 --content-disposition flag to the wget invocation.
 
 Several Test--spider-* tests fail, because an expected error code of 256
 is impossible (exit status is truncated to 8 bits).
 
 Also, your hand-rolled Makefile.in don't support --datarootdir.  I'm not
 sure whether you are interested in migrating to using Automake, which
 would solve a number of these issues; let me know if you would be
 interested in such a patch.

We actually have already migrated to Automake in the mainline revision,
which we forked some time ago. 1.11.x development has focused on
important bugfixes only.

The issues with the tests are known, and documented (see tests/README).
They are provided as-is; a work-in-progress, and not really expected
to be terribly useful. I'm actually working on improving this process
right now (and in fact, the current mainline is already much-improved in
this regard, thanks to some recent commits).

In the mainline repository, make check works as expected (modulo some
remaining issues with the tests, such as intermittent failures due to
the fact that all the tests use the same web-server port for testing,
and don't always wait quite long enough for reuse; I'll have that fixed
soon).

I would definitely recommend that make test be abandoned altogether;
alternatively, you could probably modify tests/Makefile.in to match
current mainline, which now runs a run-px script, rather than all
those hideous ./Test-foo.px  echo  echo lines in Makefile.in proper
(the tests from mainline should run fine on 1.11.3, I believe). It would
still need some work, as I mention, to really be reliable, but at least
there aren't glaring issues with broken and missing tests (and it runs
via the expected make target).

Good luck with the packaging.

- --
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIRsTS7M8hyUobTrERApudAJ9ugo0WsAL/gkJud1fK4Ip3+vDFSgCeMGvQ
XFuwWhMOlGdeOx90BGoWyOA=
=q4wa
-END PGP SIGNATURE-


Re: getpass alternative [Re: getpass documentation]

2008-06-04 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Matthew Woehlke wrote:
 Micah Cowan wrote:
 Matthew Woehlke wrote:
 Micah Cowan wrote:
 Wget just added code to the dev repo that uses gnu_getpass to support
 password prompting. This was mainly because it was quick-and-easy, and
 because gnu_getpass doesn't suffer from many of the serious flaws
 plaguing alternative implementations.
 Hehe, earlier today I merged my old, lame-but-functional patch with
 1.11.3  (I've changed systems since last time). Does this mean that when
 fedora picks up 1.12 (after there *is* a 1.12 obviously :-) ) that I
 won't need to roll my own any more? ;-)

 That's the plan.
 
 Great news, thanks! Now... if Fedora will just pick it up... :-)

I don't see why they wouldn't; they're up-to-date with 1.11.3, as of
today, which seems like a pretty quick pick-up, to me.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIRsjk7M8hyUobTrERAgXYAJ9jTqhrbAVZfl5//f7cFjzpw2rohACfYVXu
Sj9P/t1lD/1S2wQ0uaukLgc=
=mJsC
-END PGP SIGNATURE-


Re: mail-archive.com archive ends at 2008-04-07

2008-06-01 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 Micah Cowan wrote:
 Ryan Schmidt wrote:
 The wget site [1] lists two sites [2] [3] hosting the mailing list
 archives. The gmane.org archive is current but the mail-archive.com site
 only has messages up through April 7, 2008. Any idea how to get that
 archive up to date again?
 Hm, you're right; I'd been relying mainly on the gmane one, so hadn't
 noticed. I suppose the staff should be contacted about that.
 
 This may be relevant:
 
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg01261.html
 
 Not sure it explains the 1½ months of missing mails, but it sounds like
 maybe I should wait a couple days before worrying about it.

So, it turns out the archive address for mail-archive.com was
unsubscribed from wget@sunsite.dk, due to message bounces. It should be
back on now. I'm not sure about the missing month or so of messages, though.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIQy8D7M8hyUobTrERAovBAJ9k5jQjOIo/JZjB6I9Jf8nsY+ZziACghdqO
RnjUMT1ePtVlHk1sDpSxl6g=
=hEBA
-END PGP SIGNATURE-


Re: About Automated Unit Test for Wget

2008-04-06 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 Yeah. But we're not doing streaming. And you still haven't given much
 explanation for _why_ it's as hard and time-consuming as you say. Making
 a claim and demonstrating it are different things, I think.

To be clear, I'm not trying to say, I don't believe you; I'm saying,
argue the case, please, don't just make assertions. Clearly, you're
concerned about something I'm unable to see: help me to see it! If I
ignore your warnings, and wind up running headlong into what you saw in
the first place, you can't claim you gave fair warning if you didn't
provide examples of what I might run into.

For my part, I see something which, at least for first cut, I could whip
up in a couple of hours (the server emulation and associated
state-tracking, of course, would be _quite_ a bit more work). What is it
that causes our two perspectives to differ so wildly?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH+KfI7M8hyUobTrERAt4YAKCKSfG/1HtV29mm1MSdDyzFuS8lRQCfdVla
EIpSSdKhguieVxgYXln+XiQ=
=mMj2
-END PGP SIGNATURE-


Re: About Automated Unit Test for Wget

2008-04-05 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Yoshihiro Tanaka wrote:
 2008/4/5, Yoshihiro Tanaka [EMAIL PROTECTED]:

 Yes, since I want to write proposal for Unit testing, I can't skip this
  problem. But considering GSoC program is only 2 month, I'd rather narrow
  down the target - to gethttp funcion.

I have a sneaking suspicion that some chunks of functionality that you'd
want to farm out in gethttp, also have code-change repurcussions
elsewhere (probably http_loop usually). So it may be difficult to
restrict yourself to gethttp. :)

Probably better to identify the specific chunks of logic that can be
farmed out, find out how far-reaching separating those chunks might be,
and choose some specific ones to do.

You've already identified some areas; I'll comment those when I have a
chance to look more closely at the code, for comparison with your remarks.

  In addition to above, we have to think about abstraction of
  network API and file I/O API.

  But network API(such as fd_read_body, fd_read_hunk) exists in
   retr.c, and socket is opened in connect.c file, it looks that
  abstraction of network API would require major modification of
  interfaces.
 
 Or did you mean to write wget version of socket interface?
 i.e. to write our version of socket, connect,write,read,close,bind,
 listen,accept,,,? sorry I'm confused.

Yes! That's what I meant. (Except, we don't need listen, accept; and we
only need bind to support --bind-address. We're a client, not a server. ;) )

It would be enough to write function-pointers for (say), wg_socket,
wg_connect, wg_sock_write, wg_sock_read, etc, etc, and point them at
system socket, connect, etc for real Wget, but at wg_test_socket,
wg_test_connect, etc for our emulated servers.

This would mean we'd need to separate uses of read() and write() on
normal files (which should continue to use the real calls, until we
replace them with the file I/O abstractions), from uses of read(),
write(), etc on sockets, which would be using our emulated versions.

Ideally, we'd replace the use of file descriptor ints with a more opaque
mechanism; but that can be done later.

If you'd prefer, you might choose to write a proposal focusing on the
server emulation, which would easily take up a summer of itself (and
then some); particularly when you realize that we would need a file
format describing the virtual server's state (what domains and URLs
exist, what sort of headers it should respond with to certain requests,
etc). If you chose to take on, you'd probably need to settle for a
subset of the final expected product.

Note that, down the road, we'll want to encapsulate the whole
sockets-layer abstraction into an object we'd pass around as an argument
(struct net_connector * ?), as we might want to use it to handle SOCKS
for some URLs, while using direct connections for others. But that
doesn't have to happen right now; once we've got the actual abstraction
done it should be pretty easy to move it to an object-based mechanism
(just use conn-connect(...) instead of wg_connect(...)). But, if you
want to go ahead and do that now, that'd be great too.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9+7p7M8hyUobTrERApu6AKCENiEExoyTHxDUodnr/AIcRx8BOgCcD89N
k6ANTdl+4fgb+4trcADXnO0=
=fmya
-END PGP SIGNATURE-


Re: About Automated Unit Test for Wget

2008-04-05 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Daniel Stenberg wrote:
 On Sat, 5 Apr 2008, Micah Cowan wrote:
 
 Or did you mean to write wget version of socket interface? i.e. to
 write our version of socket, connect,write,read,close,bind,
 listen,accept,,,? sorry I'm confused.

 Yes! That's what I meant. (Except, we don't need listen, accept; and
 we only need bind to support --bind-address. We're a client, not a
 server. ;) )
 
 Except, you do need listen, accept and bind in a server sense since even
 if wget is a client I believe it still supports the PORT command for ftp...

Damn FTP... :)

Yeah, of course. Sorry, my view of the web tends frequently to be very
HTTP-colored. :)

(Well, technically, that _is_ the WWW, but anyway...)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH+ENm7M8hyUobTrERAlewAJ9W+vriWeVptJWG72Q3F0Njpt9TZgCfeZI4
An3zovMEfIEd1W1o7hqe5q0=
=TKsW
-END PGP SIGNATURE-


Re: About Automated Unit Test for Wget

2008-04-05 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Daniel Stenberg wrote:
 In the curl project we took a simpler route: we have our own dumb test
 servers in the test suite to run tests against and we have single files
 that describe each test case: what the server should respond, what the
 protocol dump should look like, what output to expect, what return code,
 etc. Then we have a script that reads the test case description, fires
 up the correct server(s), verifies
 all the ouputs (optionally using valgrind).
 
 This system allows us to write unit-tests if we'd like to, but mostly so
 far we've focused to test it system-wide. It is hard enough for us!

Yeah, I thought I'd seen something like that; I was thinking we might
even be able to appropriate some of that, if that looked doable. Except
that I preferred faking the server completely, so I could deal better
with cross-site issues, which AFAICT are significantly more important to
Wget than they are to Curl.

I was thinking, and should have said, that if we go this route, we'd
want to focus on high-level tests first. That also has the advantage
that if we accidentally change something during the refactoring process
(not unlikely), we will notice it, whereas focusing just on unit tests
would mean we'd have to change the code to be testable in units _before_
verification.

We already _do_ have some spawn-a-server tests code, but much of it
needs rewriting, and it still suffers when you bring in the idea of
multiple servers. The servers are driven by Perl code, rather than a
driver script or description file.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH+EZz7M8hyUobTrERAjDxAJ9N3AbEVG6NTy735hy6KtjPO7jm8wCdFX+/
gLx9jZcp0ZQqE2bQAU7VdyQ=
=u+PC
-END PGP SIGNATURE-


Re: About Automated Unit Test for Wget

2008-04-05 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hrvoje Niksic wrote:
 Micah Cowan [EMAIL PROTECTED] writes:
 
 Or did you mean to write wget version of socket interface?  i.e. to
 write our version of socket, connect,write,read,close,bind,
 listen,accept,,,? sorry I'm confused.
 Yes! That's what I meant. (Except, we don't need listen, accept; and
 we only need bind to support --bind-address. We're a client, not a
 server. ;) )

 It would be enough to write function-pointers for (say), wg_socket,
 wg_connect, wg_sock_write, wg_sock_read, etc, etc, and point them at
 system socket, connect, etc for real Wget, but at wg_test_socket,
 wg_test_connect, etc for our emulated servers.
 
 This seems like a neat idea, but it should be carefully weighed
 against the drawbacks.  Adding an ad-hoc abstraction layer is harder
 than it sounds, and has more repercussions than is immediately
 obvious.  An underspecified, unfinished abstraction layer over sockets
 makes the code harder, not easier, to follow and reason about.  You no
 longer deal with BSD sockets, you deal with an abstraction over them.
 Is it okay to call getsockname on such a socket?  How about
 setsockopt?  What about the listen/bind mechanism (which we do need,
 as Daniel points out)?

I'm having some trouble seeing how most of those present problems.
Obviously, you wouldn't call _any_ system functions on these, so yeah,
no setsockopt() unless it's a wg_setsockopt() (a wg_setsockopt would
probably be a poor way to handle it anyway, as it'd be mainly true-TCP
specific).

I don't see what you see wrt making the code harder to follow and reason
about (true abstraction rarely does, AFAICT, though there are some
counter-examples, usually of things that are much, much more abstract
than we are used to thinking about). Did you have some specific concerns?

I _am_ thinking that it'd probably be best to forgo the idea of
one-to-one correspondence of Berkeley sockets, and pass around a struct
net_connector * (and struct net_listener *), so we're not forced to
deal with file descriptor silliness (where obviously we'd have wanted to
avoid the values 0 through 2, and I was even thinking it might
_possibly_ be worthwhile to allocate real file descriptors to get the
numbers, just to avoid clashes). Then we can focus on actual abstraction
(which we don't obtain by emulating Berkeley sockets), rather than just
emulation.

While Daniel was of course right that we'd need listen, accept, etc, we
_wouldn't_ need them to begin using this layer to test against http.c.
We wouldn't even need bind, if we didn't include --bind-address in our
first tests of the http code.

 This would mean we'd need to separate uses of read() and write() on
 normal files (which should continue to use the real calls, until we
 replace them with the file I/O abstractions), from uses of read(),
 write(), etc on sockets, which would be using our emulated versions.
 
 Unless you're willing to spend a lot of time in careful design of
 these abstractions, I think this is a mistake.

Why?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH+Eoi7M8hyUobTrERAj3VAJ4vb/SPNkNo+Xyd2Hq09U4ey6zJJwCfVmG0
NSVpzr7IEdpUQkTwy/j2z9E=
=9lKJ
-END PGP SIGNATURE-


  1   2   3   4   5   6   >