subject:"Re\: Wget"

On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg [EMAIL PROTECTED] wrote:
 On Mon, 8 Sep 2008, Donald Allen wrote:

 The page I get is what would be obtained if an un-logged-in user went to
 the specified url. Opening that same url in Firefox *does* correctly
 indicate that it is logged in as me and reflects my customizations.

 First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts
 need. Then you read the capure and replay them as closely as possible using
 your tool.

 As you will find out, sites like this use all sorts of funny tricks to
 figure out you and to make it hard to automate what you're trying to do.
 They tend to use javascripts for redirects and for fiddling with cookies
 just to make sure you have a javascript and cookie enabled browser. So you
 need to work hard(er) when trying this with non-browsers.

 It's certainly still possible, even without using the browser to get the
 first cookie file. But it may take some effort.

I have not been able to retrieve a page with wget as if I were logged
in using --load-cookies and Micah's suggestion about 'Accept-Encoding'
(there was a typo in his message -- it's 'Accept-Encoding', not
'Accept-Encodings'). I did install livehttpheaders and tried
--no-cookies and --header cookie info from livehttpheaders and that
did work. Some of the cookie info sent by Firefox was a mystery,
because it's not in the cookie file. Perhaps that's the crucial
difference -- I'm speculating that wget isn't sending quite the same
thing as Firefox when --load-cookies is used, because Firefox is
adding stuff that isn't in the cookie file. Just a guess. Is there a
way to ask wget to print the headers it sends (ala livehttpheaders)?
I've looked through the options on the man page and didn't see
anything, though I might have missed it.


 --

  / daniel.haxx.se

Re: Wget and Yahoo login?

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg [EMAIL PROTECTED] wrote:
On Mon, 8 Sep 2008, Donald Allen wrote:

The page I get is what would be obtained if an un-logged-in user went to
the specified url. Opening that same url in Firefox *does* correctly
indicate that it is logged in as me and reflects my customizations.
First, LiveHTTPHeaders is the Firefox plugin everyone who tries these stunts
need. Then you read the capure and replay them as closely as possible using
your tool.

As you will find out, sites like this use all sorts of funny tricks to
figure out you and to make it hard to automate what you're trying to do.
They tend to use javascripts for redirects and for fiddling with cookies
just to make sure you have a javascript and cookie enabled browser. So you
need to work hard(er) when trying this with non-browsers.

It's certainly still possible, even without using the browser to get the
first cookie file. But it may take some effort.

I have not been able to retrieve a page with wget as if I were logged
in using --load-cookies and Micah's suggestion about 'Accept-Encoding'
(there was a typo in his message -- it's 'Accept-Encoding', not
'Accept-Encodings'). I did install livehttpheaders and tried
--no-cookies and --header cookie info from livehttpheaders and that
did work.

That's how I did it as well (except I got the headers from tcpdump); I'm
using Firefox 3, so don't have access to FF's new sqllite-based cookies
file (apart from the patch at
http://wget.addictivecode.org/FrontPage?action=AttachFiledo=viewtarget=wget-firefox3-cookie.patch).

Some of the cookie info sent by Firefox was a mystery,
because it's not in the cookie file. Perhaps that's the crucial
difference -- I'm speculating that wget isn't sending quite the same
thing as Firefox when --load-cookies is used, because Firefox is
adding stuff that isn't in the cookie file. Just a guess.

Probably there are session cookies involved, that are sent in the first
page, that you're not sending back with the form submit.
- --keep-session-cookies and --save-cookies=foo.txt make a good
combination.

Is there a
way to ask wget to print the headers it sends (ala livehttpheaders)?
I've looked through the options on the man page and didn't see
anything, though I might have missed it.

- --debug

- --
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxqL77M8hyUobTrERAovFAJ9yagS2xW+2wFG65BwiFkJNfTMylgCfYaq7
1vOmTDimFg8E7Cn+Q+HGZn8=
=JKXH
-END PGP SIGNATURE-

Re: Wget and Yahoo login?

On Tue, Sep 9, 2008 at 12:23 PM, Micah Cowan [EMAIL PROTECTED] wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
On Tue, Sep 9, 2008 at 3:14 AM, Daniel Stenberg [EMAIL PROTECTED] wrote:
On Mon, 8 Sep 2008, Donald Allen wrote:

The page I get is what would be obtained if an un-logged-in user went
to
the specified url. Opening that same url in Firefox *does* correctly
indicate that it is logged in as me and reflects my customizations.
First, LiveHTTPHeaders is the Firefox plugin everyone who tries these
stunts
need. Then you read the capure and replay them as closely as possible
using
your tool.

As you will find out, sites like this use all sorts of funny tricks to
figure out you and to make it hard to automate what you're trying to do.
They tend to use javascripts for redirects and for fiddling with cookies
just to make sure you have a javascript and cookie enabled browser. So
you
need to work hard(er) when trying this with non-browsers.

It's certainly still possible, even without using the browser to get the
first cookie file. But it may take some effort.

That's how I did it as well (except I got the headers from tcpdump); I'm
using Firefox 3, so don't have access to FF's new sqllite-based cookies
file (apart from the patch at

http://wget.addictivecode.org/FrontPage?action=AttachFiledo=viewtarget=wget-firefox3-cookie.patch
).

Is there a
way to ask wget to print the headers it sends (ala livehttpheaders)?
I've looked through the options on the man page and didn't see
anything, though I might have missed it.

- --debug

Well, I rebuilt my wget with the 'debug' use flag and ran it on the yahoo
test page (after having logged in to yahoo with firefox, of course) with
--load-cookies and the accept-encoding header item, with --debug. Very
useful. wget is sending every cookie item in firefox's cookies.txt. But
firefox sends three additional cookie items in the header that wget does not
send. Those items are *not* in firefox's cookies.txt so wget has no way of
knowing about them. Is it possible that firefox is not writing session
cookies to the file?

The result of this test, just to be clear, was a page that indicated yahoo
thought I was not logged in. Those extra items firefox is sending appear to
be the difference, because I included them (from the livehttpheaders output)
when I tried sending the cookies manually with --header, I got the same page
back with wget that indicated that yahoo knew I was logged in and formatted
with page with my preferences.

/Don

iD8DBQFIxqL77M8hyUobTrERAovFAJ9yagS2xW+2wFG65BwiFkJNfTMylgCfYaq7
1vOmTDimFg8E7Cn+Q+HGZn8=
=JKXH
-END PGP SIGNATURE-

Re: Wget and Yahoo login?

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 The result of this test, just to be clear, was a page that indicated
 yahoo thought I was not logged in. Those extra items firefox is sending
 appear to be the difference, because I included them (from the
 livehttpheaders output) when I tried sending the cookies manually with
 --header, I got the same page back with wget that indicated that yahoo
 knew I was logged in and formatted with page with my preferences.

Perhaps you missed this in my last message:

 Probably there are session cookies involved, that are sent in the first
 page, that you're not sending back with the form submit.
 --keep-session-cookies and --save-cookies=foo.txt make a good
 combination.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxrJ17M8hyUobTrERAvdsAJ9XEwMfimHXRUXKtV66P+YsG+tA7gCfWKbq
nCqAmXJfU3kTncMQkKk0JZo=
=17Yr
-END PGP SIGNATURE-

Re: Wget and Yahoo login?

On Tue, Sep 9, 2008 at 1:29 PM, Micah Cowan [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Donald Allen wrote:
  The result of this test, just to be clear, was a page that indicated
  yahoo thought I was not logged in. Those extra items firefox is sending
  appear to be the difference, because I included them (from the
  livehttpheaders output) when I tried sending the cookies manually with
  --header, I got the same page back with wget that indicated that yahoo
  knew I was logged in and formatted with page with my preferences.

 Perhaps you missed this in my last message:

  Probably there are session cookies involved, that are sent in the first
  page, that you're not sending back with the form submit.
  --keep-session-cookies and --save-cookies=foo.txt make a good
  combination.


I think we're mis-communicating, easily my fault, since I know just enough
about this stuff to be dangerous.

I am doing the yahoo session login with firefox, not with wget, so I'm using
the first and easier of your two suggested methods. I'm guessing you are
thinking that I'm trying to login to the yahoo session with wget, and thus
--keep-session-cookies and --save-cookies=foo.txt would make perfect sense
to me, but that's not what I'm doing (yet -- if I'm right about what's
happening here, I'm going to have to resort to this). But using firefox to
initiate the session, it looks to me like wget never gets to see the session
cookies because I don't think firefox writes them to its cookie file (which
actually makes sense -- if they only need to live as long as the session,
why write them out?).

/Don





 - --
 Micah J. Cowan
 Programmer, musician, typesetting enthusiast, gamer.
 GNU Maintainer: wget, screen, teseq
 http://micah.cowan.name/
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.7 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFIxrJ17M8hyUobTrERAvdsAJ9XEwMfimHXRUXKtV66P+YsG+tA7gCfWKbq
 nCqAmXJfU3kTncMQkKk0JZo=
 =17Yr
 -END PGP SIGNATURE-

Re: Wget and Yahoo login?

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 I am doing the yahoo session login with firefox, not with wget, so I'm
 using the first and easier of your two suggested methods. I'm guessing
 you are thinking that I'm trying to login to the yahoo session with
 wget, and thus --keep-session-cookies and --save-cookies=foo.txt would
 make perfect sense to me, but that's not what I'm doing (yet -- if I'm
 right about what's happening here, I'm going to have to resort to this).
 But using firefox to initiate the session, it looks to me like wget
 never gets to see the session cookies because I don't think firefox
 writes them to its cookie file (which actually makes sense -- if they
 only need to live as long as the session, why write them out?).

Yes, and I understood this; the thing is, that if session cookies are
involved (i.e., cookies that are marked for immediate expiration and are
not meant to be saved to the cookies file), then I don't see how you
have much choice other than to use the harder method, or else to fake
the session cookies by manually inserting them to your cookies file or
whatnot (not sure how well that may be expected to work). Or, yeah, add
an explicit --header 'Cookie: ...'.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxrVD7M8hyUobTrERAt19AJ9bmmczCKjzMtGCoXb8B5g25uMLRQCeK8qh
M57W3Reqj+/pO8GuDwb9Nok=
=ajp/
-END PGP SIGNATURE-

Re: Wget and Yahoo login?

On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Donald Allen wrote:
  I am doing the yahoo session login with firefox, not with wget, so I'm
  using the first and easier of your two suggested methods. I'm guessing
  you are thinking that I'm trying to login to the yahoo session with
  wget, and thus --keep-session-cookies and --save-cookies=foo.txt would
  make perfect sense to me, but that's not what I'm doing (yet -- if I'm
  right about what's happening here, I'm going to have to resort to this).
  But using firefox to initiate the session, it looks to me like wget
  never gets to see the session cookies because I don't think firefox
  writes them to its cookie file (which actually makes sense -- if they
  only need to live as long as the session, why write them out?).

 Yes, and I understood this; the thing is, that if session cookies are
 involved (i.e., cookies that are marked for immediate expiration and are
 not meant to be saved to the cookies file), then I don't see how you
 have much choice other than to use the harder method, or else to fake
 the session cookies by manually inserting them to your cookies file or
 whatnot (not sure how well that may be expected to work). Or, yeah, add
 an explicit --header 'Cookie: ...'.


Ah, the misunderstanding was that the stuff you thought I missed was
intended to push me in the direction of Plan B -- log in to yahoo with wget.
I understand now. I'll look at trying to make this work. Thanks for all the
help, though I can't guarantee that you are done yet :-) But, hopefully,
this exchange will benefit others.

/Don



 - --
 Micah J. Cowan
 Programmer, musician, typesetting enthusiast, gamer.
 GNU Maintainer: wget, screen, teseq
 http://micah.cowan.name/
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.7 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFIxrVD7M8hyUobTrERAt19AJ9bmmczCKjzMtGCoXb8B5g25uMLRQCeK8qh
 M57W3Reqj+/pO8GuDwb9Nok=
 =ajp/
 -END PGP SIGNATURE-

Re: Wget and Yahoo login?

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 
 
 On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED] wrote:
 
 Donald Allen wrote:
 I am doing the yahoo session login with firefox, not with wget,
 so I'm
 using the first and easier of your two suggested methods. I'm
 guessing
 you are thinking that I'm trying to login to the yahoo session with
 wget, and thus --keep-session-cookies and
 --save-cookies=foo.txt would
 make perfect sense to me, but that's not what I'm doing (yet --
 if I'm
 right about what's happening here, I'm going to have to resort to
 this).
 But using firefox to initiate the session, it looks to me like wget
 never gets to see the session cookies because I don't think firefox
 writes them to its cookie file (which actually makes sense -- if they
 only need to live as long as the session, why write them out?).
 
 Yes, and I understood this; the thing is, that if session cookies are
 involved (i.e., cookies that are marked for immediate expiration and are
 not meant to be saved to the cookies file), then I don't see how you
 have much choice other than to use the harder method, or else to fake
 the session cookies by manually inserting them to your cookies file or
 whatnot (not sure how well that may be expected to work). Or, yeah, add
 an explicit --header 'Cookie: ...'.
 
 
 Ah, the misunderstanding was that the stuff you thought I missed was
 intended to push me in the direction of Plan B -- log in to yahoo with
 wget.

Yes; and that's entirely my fault, as I didn't explicitly say that.

 I understand now. I'll look at trying to make this work. Thanks
 for all the help, though I can't guarantee that you are done yet :-)
 But, hopefully, this exchange will benefit others.

I was actually surprised you kept going after I pointed out that it
required the Accept-Encoding header that results in gzipped content.
This behavior is a little surprising to me from Yahoo!. It's not
surprising in _general_, but for a site that really wants to be as
accessible as possible (I would think?), insisting on the latest
browsers seems ill-advised.

Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape,
visit a site, and get a server-generated page that's empty other than
the phrase You're not using Internet Explorer. :p

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik
3HbbATyqnrm0hAJXqNTqpl4=
=3XD/
-END PGP SIGNATURE-

Re: Wget and Yahoo login?

On Tue, Sep 9, 2008 at 1:51 PM, Micah Cowan [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Donald Allen wrote:
 
 
  On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED]
  mailto:[EMAIL PROTECTED] wrote:
 
  Donald Allen wrote:
  I am doing the yahoo session login with firefox, not with wget,
  so I'm
  using the first and easier of your two suggested methods. I'm
  guessing
  you are thinking that I'm trying to login to the yahoo session with
  wget, and thus --keep-session-cookies and
  --save-cookies=foo.txt would
  make perfect sense to me, but that's not what I'm doing (yet --
  if I'm
  right about what's happening here, I'm going to have to resort to
  this).
  But using firefox to initiate the session, it looks to me like wget
  never gets to see the session cookies because I don't think firefox
  writes them to its cookie file (which actually makes sense -- if they
  only need to live as long as the session, why write them out?).
 
  Yes, and I understood this; the thing is, that if session cookies are
  involved (i.e., cookies that are marked for immediate expiration and are
  not meant to be saved to the cookies file), then I don't see how you
  have much choice other than to use the harder method, or else to fake
  the session cookies by manually inserting them to your cookies file or
  whatnot (not sure how well that may be expected to work). Or, yeah, add
  an explicit --header 'Cookie: ...'.
 
 
  Ah, the misunderstanding was that the stuff you thought I missed was
  intended to push me in the direction of Plan B -- log in to yahoo with
  wget.

 Yes; and that's entirely my fault, as I didn't explicitly say that.


No problem.



  I understand now. I'll look at trying to make this work. Thanks
  for all the help, though I can't guarantee that you are done yet :-)
  But, hopefully, this exchange will benefit others.

 I was actually surprised you kept going after I pointed out that it
 required the Accept-Encoding header that results in gzipped content.


That didn't faze me because the pages I'm after will be processed by a
python program, so having to gunzip would not require a manual step.


 This behavior is a little surprising to me from Yahoo!. It's not
 surprising in _general_, but for a site that really wants to be as
 accessible as possible (I would think?), insisting on the latest
 browsers seems ill-advised.

 Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape,
 visit a site, and get a server-generated page that's empty other than
 the phrase You're not using Internet Explorer. :p


And taking it one step further, I'm greatly enjoying watching Microsoft
thrash around, trying to save themselves, which I don't think they will.
Perhaps they'll re-invent themselves, as IBM did, but their cash cow is not
going to produce milk too much longer. I've just installed the Chrome beta
on the Windows side of one of my machines (I grudgingly give it 10 Gb on
each machine; Linux gets the rest), and it looks very, very nice. They've
still got work to do, but they appear to be heading in a very good
direction. These are smart people at Google. All signs seem to be pointing
towards more and more computing happening on the server side in the coming
years.

/Don




 - --
 Micah J. Cowan
 Programmer, musician, typesetting enthusiast, gamer.
 GNU Maintainer: wget, screen, teseq
 http://micah.cowan.name/
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.7 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik
 3HbbATyqnrm0hAJXqNTqpl4=
 =3XD/
 -END PGP SIGNATURE-

Re: Wget and Yahoo login?

After surprisingly little struggle, I got Plan B working -- logged into
yahoo with wget, saved the cookies, including session cookies, and then
proceeded to fetch pages using the saved cookies. Those pages came back
logged in as me, with my customizations. Thanks to Tony, Daniel, and Micah
-- you all provided critical advice in solving this problem.

/Don

On Tue, Sep 9, 2008 at 2:21 PM, Donald Allen [EMAIL PROTECTED] wrote:



 On Tue, Sep 9, 2008 at 1:51 PM, Micah Cowan [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Donald Allen wrote:
 
 
  On Tue, Sep 9, 2008 at 1:41 PM, Micah Cowan [EMAIL PROTECTED]
  mailto:[EMAIL PROTECTED] wrote:
 
  Donald Allen wrote:
  I am doing the yahoo session login with firefox, not with wget,
  so I'm
  using the first and easier of your two suggested methods. I'm
  guessing
  you are thinking that I'm trying to login to the yahoo session with
  wget, and thus --keep-session-cookies and
  --save-cookies=foo.txt would
  make perfect sense to me, but that's not what I'm doing (yet --
  if I'm
  right about what's happening here, I'm going to have to resort to
  this).
  But using firefox to initiate the session, it looks to me like wget
  never gets to see the session cookies because I don't think firefox
  writes them to its cookie file (which actually makes sense -- if they
  only need to live as long as the session, why write them out?).
 
  Yes, and I understood this; the thing is, that if session cookies are
  involved (i.e., cookies that are marked for immediate expiration and are
  not meant to be saved to the cookies file), then I don't see how you
  have much choice other than to use the harder method, or else to fake
  the session cookies by manually inserting them to your cookies file or
  whatnot (not sure how well that may be expected to work). Or, yeah, add
  an explicit --header 'Cookie: ...'.
 
 
  Ah, the misunderstanding was that the stuff you thought I missed was
  intended to push me in the direction of Plan B -- log in to yahoo with
  wget.

 Yes; and that's entirely my fault, as I didn't explicitly say that.


 No problem.



  I understand now. I'll look at trying to make this work. Thanks
  for all the help, though I can't guarantee that you are done yet :-)
  But, hopefully, this exchange will benefit others.

 I was actually surprised you kept going after I pointed out that it
 required the Accept-Encoding header that results in gzipped content.


 That didn't faze me because the pages I'm after will be processed by a
 python program, so having to gunzip would not require a manual step.


 This behavior is a little surprising to me from Yahoo!. It's not
 surprising in _general_, but for a site that really wants to be as
 accessible as possible (I would think?), insisting on the latest
 browsers seems ill-advised.

 Ah, well. At least the days are _mostly_ gone when I'd fire up Netscape,
 visit a site, and get a server-generated page that's empty other than
 the phrase You're not using Internet Explorer. :p


 And taking it one step further, I'm greatly enjoying watching Microsoft
 thrash around, trying to save themselves, which I don't think they will.
 Perhaps they'll re-invent themselves, as IBM did, but their cash cow is not
 going to produce milk too much longer. I've just installed the Chrome beta
 on the Windows side of one of my machines (I grudgingly give it 10 Gb on
 each machine; Linux gets the rest), and it looks very, very nice. They've
 still got work to do, but they appear to be heading in a very good
 direction. These are smart people at Google. All signs seem to be pointing
 towards more and more computing happening on the server side in the coming
 years.

 /Don




 - --
 Micah J. Cowan
 Programmer, musician, typesetting enthusiast, gamer.
 GNU Maintainer: wget, screen, teseq
 http://micah.cowan.name/
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.7 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFIxreZ7M8hyUobTrERAslyAJwKfirhzth9ACgdunxp/rfQlR86mQCcClik
 3HbbATyqnrm0hAJXqNTqpl4=
 =3XD/
 -END PGP SIGNATURE-

Re: Wget and Yahoo login?

2008-09-08 Thread Donald Allen

2008/9/8 Tony Godshall [EMAIL PROTECTED]:
 I haven't done this but I can speculate that you need to
 have wget identify itself as firefox.

When I read this, I thought it looked promising, but it doesn't work.
I tried sending exactly the user-agent string firefox is sending and
still got a page from yahoo that clearly indicates yahoo thinks I'm
not logged in.

/Don


 Quote from man wget...

   -U agent-string
   --user-agent=agent-string
   Identify as agent-string to the HTTP server.

   The HTTP protocol allows the clients to identify themselves
 using a User-Agent header field.  This enables distinguishing the
 WWW software,
   usually for statistical purposes or for tracing of protocol
 violations.  Wget normally identifies as Wget/version, version being
 the current ver‐
   sion number of Wget.

   However, some sites have been known to impose the policy of
 tailoring the output according to the User-Agent-supplied
 information.  While this
   is not such a bad idea in theory, it has been abused by
 servers denying information to clients other than (historically)
 Netscape or, more fre‐
   quently, Microsoft Internet Explorer.  This option allows
 you to change the User-Agent line issued by Wget.  Use of this
 option is discouraged,
   unless you really know what you are doing.


 On Mon, Sep 8, 2008 at 12:25 PM, Donald Allen [EMAIL PROTECTED] wrote:
 There was a recent discussion concerning using wget to obtain pages
 from yahoo logged into yahoo as a particular user. Micah replied to
 Rick Nakroshis with instructions describing two methods for doing
 this. This information has also been added by Micah to the wiki.

 I just tried the simpler of the two methods -- logging into yahoo with
 my browser (Firefox 2.0.0.16) and then downloading a page with

 wget --output-document=/tmp/yahoo/yahoo.htm --load-cookies my home
 directory/.mozilla/firefox/id2dmo7r.default/cookies.txt
 'http://yahoo url'

 The page I get is what would be obtained if an un-logged-in user went
 to the specified url. Opening that same url in Firefox *does*
 correctly indicate that it is logged in as me and reflects my
 customizations.

 wget -V:
 GNU Wget 1.11.1

 I am running a reasonably up-to-date Gentoo system (updated within the
 last month) on a Thinkpad X61.

 Have I missed something here? Any help will be appreciated. Please
 include my personal address in your replies as I am not (yet) a
 subscriber to this list.

 Thanks --
 /Don Allen




 --
 Best Regards.
 Please keep in touch.
 This is unedited.
 P-)

Re: Wget and Yahoo login?

2008-09-08 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Donald Allen wrote:
 There was a recent discussion concerning using wget to obtain pages
 from yahoo logged into yahoo as a particular user. Micah replied to
 Rick Nakroshis with instructions describing two methods for doing
 this. This information has also been added by Micah to the wiki.
 
 I just tried the simpler of the two methods -- logging into yahoo with
 my browser (Firefox 2.0.0.16) and then downloading a page with
 
 wget --output-document=/tmp/yahoo/yahoo.htm --load-cookies my home
 directory/.mozilla/firefox/id2dmo7r.default/cookies.txt
 'http://yahoo url'
 
 The page I get is what would be obtained if an un-logged-in user went
 to the specified url. Opening that same url in Firefox *does*
 correctly indicate that it is logged in as me and reflects my
 customizations.

Are you signing into the main Yahoo! site?

When I try to do so, whether I use the cookies or no, I get a message
about update your browser to something more modern or the like. The
difference appears to be a combination of _both_ User-Agent (as you've
done), _and_ --header Accept-Encodings: gzip,deflate. This plus
appropriate cookies gets me a decent logged-in page, but of course it's
gzip-compressed.

Since Wget doesn't currently support gzip-decoding and the like, that
makes the use of Wget in this situation cumbersome. Support for
something like this probably won't be seen until 1.13 or 1.14, I'm afraid.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxdw77M8hyUobTrERAi/QAJ0atPMeUQ/0YCNwAP+XiH4nDyvclwCcDxYo
obud0CjpATBYDvA0eS3ZHGY=
=vv4R
-END PGP SIGNATURE-

Re: [wget-notify] add a new option

2008-09-02 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

houda hocine wrote:
  Hi,

Hi houda.

This message was sent to the wget-notify, which was not the proper
forum. Wget-notify is reserved for bug-change and (previously) commit
notifications, and is not intended for discussion (though I obviously
haven't blocked discussions; the original intent was to be able to
discuss commits, but I'm not sure I need to allow discussions any more,
so it may be disallowed soon).

The appropriate list would be wget@sunsite.dk, to which this discussion
has been redirected.

 we create a new format for archiviving (. warc), and we want to ensure
 that wget generate directly this format from the input url .
 You can help me by some ideas  to achieve this new option?
 The format is (warc -wget url)
 I am in the process of trying to understand the source code to add this
 new option.  Which .c  file fallows me to do this?

Doing this is not likely to be a trivial undertaking: the current
file-output interface isn't really abstracted enough to allow this, so
basically you'll need to modify most of the existing .c files. We are
hoping at some future point to allow for a more generic output format,
for direct output to (for instance) tarballs and .mhtml archives. At
that point, it'd probably be fairly easy to write extensions to do what
you want.

In the meantime, though, it'll be a pain in the butt. I can't really
offer much help; the best way to understand the source is to read and
explore it. However, on the general topic of adding new options to Wget,
Tony Lewis has written the excellent guide at
http://wget.addictivecode.org/OptionsHowto. Hope that helps!

Please note that I won't likely be entertaining patches to Wget to make
it output to non-mainstream archive formats, and even once generic
output mechanisms are supported, the mainstream archive formats will
most likely be supported as extension plugins or similar, and not as
built-in support within Wget.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIvbyf7M8hyUobTrERApl8AJwNvWOdDd0Z//wbNzN/jyZFqKI5iQCfQOx4
3zlxPGaVqjsPhwa7ZwB4wrs=
=Zy+N
-END PGP SIGNATURE-

Re: Wget function

2008-08-25 Thread karlito

 Hello,

 First of all i would thank you for your great tool

 I have a request

 i use this function to save url with absolute link so it's very good

 wget -k http://www.google.fr/

 but i want to save this file as other name than index.html like for
 example  google-is-good.html

 i have try this

 wget -k –output-document=google-is-good.html http://www.google.fr/

 is work except i lost absolute link and it's terrible

 i don't know how to fix this problem wich combinaison i have to made for
 use wget - k  with another name ??

 can you help me i don't find the solution  also where i can find last
 verssion for windows

 thank you for your time


 regard carlos .

Re: Wget function

2008-08-25 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

karlito wrote:
 
 
 Hello,
  
 First of all i would thank you for your great tool
  
 I have a request
  
 i use this function to save url with absolute link so it's very good
  
 wget -k http://www.google.fr/
  
 but i want to save this file as other name than index.html like for
 example  google-is-good.html
  
 i have try this
  
 wget -k –output-document=google-is-good.html http://www.google.fr/
  
 is work except i lost absolute link and it's terrible

Yeah. Conversions won't work with --output-document, which behaves
rather like a shell redirection.

 i don't know how to fix this problem wich combinaison i have to made
 for use wget - k  with another name ??

You could always rename it afterwards.

In your specific case, the current development sources (which will
become Wget 1.12) have a --default-page=google-is-good.html option for
specifying the default page name, thanks to Joao Ferreira. It's not yet
available in any release.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIsv3N7M8hyUobTrERAskoAJ4lHZK+VEBWYuFzOtbd57wEEvYm0wCdEVSK
el6v3e0TkKpQtOG2b5ZiHcI=
=/+sB
-END PGP SIGNATURE-

Re: WGET :: [Correction de texte]

2008-08-25 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tom wrote:
 Téléchargement récursif:
   -r,  --recursive  spécifer un téléchargement récursif.
   -l,  --level=NOMBRE   _*profondeeur*_ maximale de récursion (inf
 ou 0 pour infini).
 
 Juste un e à enlever de profondeeur, et ca sera réglé !

This issue appears to have been fixed with the latest French
translation. It will be released with Wget 1.12.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIswBE7M8hyUobTrERAufeAKCIl4ghMvo2JolNfsSAYCTd92v9OwCfS89O
iT3urRXKctZuucXnOn9tGLc=
=v5SC
-END PGP SIGNATURE-

Re: Wget function

2008-08-25 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Please keep the list in the replies.

karlito wrote:
 hi thank you for the reply my problem can be fixed on the next  verssion ?
  
 because it's for batch
  
 i have more 1000 url to made so is that why i need to find a solution
  
 also when you mean rename
  
 what is the function to rename with wget ?

I mean, just use the mv or rename command on your operating system.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIswfR7M8hyUobTrERAubkAJ0VL2UPnNQtD27waPVwFkeUwbUp9wCfXerh
dZBr4e7ZBKcEE5Kzrjv1mi8=
=GoKL
-END PGP SIGNATURE-

Re: wget and wiki crawling

2008-08-22 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

asm c wrote:
 I've recently been using wget, and got it working for the most part, but
 there's one issue that's really been bugging me. One of the parameters I
 use is '-R *action=*,*oldid=*' (side note on the platform: ZSH on
 NetBSD on the SDF public access unix system, although I've also used it
 on windows with the same result). The purpose of this parameter is so
 that, when wget crawls a mid-sized wiki I'd like to have a local copy
 of, it doesn't bother with all the history pages, edit pages, and so
 forth. Not downloading these would save me an enormous amount of time.
 Unfortunately, the parameter is ignored until after the php page is
 downloaded. So, because it waits until it's downloaded to delete it,
 using the param doesn't really help at all.
 
 Does anyone know how I can stop wget from even downloading matching pages?

Well, you don't mention it, but I'll assume that those patterns occur in
the query string portion of the URL: that is, they follow a question
mark (?) that appears at some point.

Unfortunately, the -R and -A options only apply to the filename
portion of the URL: that is, whatever falls between the first question
mark, and the first preceding slash (/). Confusingly, it is also then
applied _after_ files are downloaded, to determine whether they should
be deleted after the fact: so Wget probably downloads those files you
really wish it wouldn't, and then deletes them afterwards anyway.

Worse, there's no way around this, currently. This is part of a suite of
problems that are currently slated to be addressed soon. The most
pertinent to your problem, though, is the need for a way to match
against query strings. I'm very much hoping to get around to this before
the next major Wget release, version 1.12. It's being tracked here:

https://savannah.gnu.org/bugs/index.php?22089

If you add yourself to the Cc list, you'll be able to follow along on
its progress.

- --
Cheers!
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIr55d7M8hyUobTrERAu4KAJsHmDTZ46ioEGOTprdE/aTGrj853QCfet84
+c+npJnPwC/86/rLpn5rB8s=
=abdv
-END PGP SIGNATURE-

RE: Wget and Yahoo login?

2008-08-21 Thread Tony Lewis

Micah Cowan wrote:

 The easiest way to do what you want may be to log in using your browser,
 and then tell Wget to use the cookies from your browser, using

Given the frequency of the login and then download a file use case , it
should probably be documented on the wiki. (Perhaps it already is. :-)

Also, it would probably be helpful to have a shell script to automate this.

Tony

Re: Wget and Yahoo login?

2008-08-21 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Micah Cowan wrote:
 
 The easiest way to do what you want may be to log in using your browser,
 and then tell Wget to use the cookies from your browser, using
 
 Given the frequency of the login and then download a file use case , it
 should probably be documented on the wiki. (Perhaps it already is. :-)

Yeah, at
http://wget.addictivecode.org/FrequentlyAskedQuestions#password-protected

I think you missed the final sentence of my how-to:

 (I'm going to put this up on the Wgiki Faq now, at
 http://wget.addictivecode.org/FrequentlyAskedQuestions)

:)

(Back to you:)
 Also, it would probably be helpful to have a shell script to automate this.

I filed the following issue some time ago:
https://savannah.gnu.org/bugs/index.php?22561

The report is low on details; but I was envisioning something that would
spew out forms and their fields, accept values for fields in one form,
and invoke the appropriate Wget command to do the submission.

I don't know if it could be _completely_ automated, since it's not 100%
possible for the script to know which form fields are the ones it should
be filling out.

OTOH, there are some damn good heuristics that could be done: I imagine
that the right form (in the event of more than one) can usually be
guessed by seeing which one has a password-type input (assuming
there's also only one of those). If that form has only one text-type
input, then we've found the username field as well. Name-based
heuristics (with pass, user, uname, login, etc) could also help.

If someone wants to do this, that'd be terrific. Could probably reuse
the existing HTML parser code from Wget. Otherwise, it'd probably be a
while before I could get to it, since I've got higher priorities that
have been languishing.

Such a tool might also be an appropriate place to add FF3 sqllite
cookies support.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIrb0s7M8hyUobTrERAlVXAJ9YnAM7JiQrxrB/KclA1FXDnoVswgCdGO7t
Vaa98nhNRuEY4aLMx2BFXm0=
=ScoA
-END PGP SIGNATURE-

Re: Wget and Yahoo login?

2008-08-11 Thread Rick Nakroshis


At 04:27 PM 8/10/2008, you wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rick Nakroshis wrote:
 Micah,

 If you will excuse a quick question about Wget, I'm trying to find out
 if I can use it to download a page from Yahoo that requires me to be
 logged in using my Yahoo profile name and password.  It's a display of a
 CSV file, and the only wrinkle is trying to get past the Yahoo login.

 Try as I may, I just can't seem to find anything about Wget and Yahoo.
 Any suggestions or pointers?

Hi Rick,

In the future, it's better if you post questions to the mailing list at
wget@sunsite.dk; I don't always have time to respond.

The easiest way to do what you want may be to log in using your browser,
and then tell Wget to use the cookies from your browser, using
- --load-cookies=path-to-browser's-cookies. Of course, this only works
if your browser saves its cookies in the standard text format (Firefox
prior to version 3 will do this), or can export to that format (note
that someone contributed a patch to allow Wget to work with Firefox 3
cookies; it's linked from http://wget.addictivecode.org/, it's
unoffocial so I can't vouch for its quality).

Otherwise, you can perform the login using Wget, saving the cookies to a
file of your choice, using --post-data=..., --save-cookies=cookies.txt,
and probably --keep-session-cookies. This will require that you know
what data to place in --post-data, which generally requires that you dig
around in the HTML to find the right form field names, and where to post
them.

For instance, if you find a form like the following within the page
containing the log-in form:

form action=/doLogin.php method=POST
  input type=text name=s-login
  input type=password name=s-pass
/form

then you need to do something like:

  $ wget --post-data='s-login=USERNAMEs-pass=PASSWORD' \
--save-cookies=my-cookies.txt --keep-session-cookies \
http://HOSTNAME/doLogin.php

(Note that you _don't_ necessarily send the information to the page that
 had the login page: you send it to the spot mentioned in the action
attribute of the password form.)

Once this is done, you _should_ be able to perform further operations
with Wget as if you're logged in, by using

  $ wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \
--keep-session-cookies ...

(I'm going to put this up on the Wgiki Faq now, at
http://wget.addictivecode.org/FrequentlyAskedQuestions)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIn09A7M8hyUobTrERAu04AJ9EgRoBBhvNCDwOt87f91p+HpWktACdFgMM
KEfliBtfrPBbh/XdvusEPiw=
=qlGZ
-END PGP SIGNATURE-



Micah,

Thank you for taking the time to answer so thoroughly, and doing so 
promptly, too.  You've given me a great boost forward, and I appreciate it.


Thank you, sir!


Rick

Re: WGET :: [Correction de texte]

2008-08-11 Thread Julien

Bonjour Tom,

Merci de cette information.
Mais pourrais tu nous préciser de quelle version de Wget il s'agit?
Tu peux obtenir cette information avec: wget --version
Je te recommande la dernière version de Wget, disponible ici:
http://wget.addictivecode.org/FrequentlyAskedQuestions#download
Aussi, la langue de cette liste de diffusion est l'Anglais.

Merci,
Julien.


Hi Tom,

Thanks for this information.
But, could you tell us what version of Wget are you using?
You can see that using: wget --version
I advise you to try the last version, available here:
http://wget.addictivecode.org/FrequentlyAskedQuestions#download
Moreover, the language of this mailing list is English.

Thanks,
Julien.

2008/8/11 Tom [EMAIL PROTECTED]:
 Bonjour !

 Je souhaite vous informer d'une touche restée appuyée un quart de seconde
 trop longtemps semble-t-il !

 Dans l'aide de Wget (wget --help), nous trouvons en effet :


 Téléchargement récursif:
   -r,  --recursive  spécifer un téléchargement récursif.
   -l,  --level=NOMBRE   profondeeur maximale de récursion (inf ou 0 pour
 infini).



 Juste un e à enlever de profondeeur, et ca sera réglé !

 Comme il était indiqué Transmettre toutes anomalies ou suggestions à
 [EMAIL PROTECTED]., je me suis permis de vous le signaler !

 Merci pour cet outil, et bonne continuation !

 Cordialement,

 Tom

Re: WGET :: [Correction de texte]

2008-08-11 Thread Saint Xavier

* Tom ([EMAIL PROTECTED]) wrote:
 Bonjour !

bonjour,

 Je souhaite vous informer d'une touche restée appuyée un quart de seconde
 trop longtemps semble-t-il !
...
 Téléchargement récursif:
   -r,  --recursive  spécifer un téléchargement récursif.
   -l,  --level=NOMBRE   *profondeeur* maximale de récursion (inf ou 0
 Juste un e à enlever de profondeeur, et ca sera réglé !

En effet, merci !

Micah, instead of profondeeur it should be profondeur.
Where do you forward that info, French GNU translation team ?
(./po/fr.po around line 1472)

Saint Xavier.

Re: WGET :: [Correction de texte]

2008-08-11 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Saint Xavier wrote:
 * Tom ([EMAIL PROTECTED]) wrote:
 Bonjour !
 
 bonjour,
 
 Je souhaite vous informer d'une touche restée appuyée un quart de seconde
 trop longtemps semble-t-il !
 ...
 Téléchargement récursif:
   -r,  --recursive  spécifer un téléchargement récursif.
   -l,  --level=NOMBRE   *profondeeur* maximale de récursion (inf ou 0
 Juste un e à enlever de profondeeur, et ca sera réglé !
 
 En effet, merci !
 
 Micah, instead of profondeeur it should be profondeur.
 Where do you forward that info, French GNU translation team ?
 (./po/fr.po around line 1472)

Yup. The mailing address for the French translation team is at
[EMAIL PROTECTED] The team page is
http://translationproject.org/team/fr.html; other translation teams are
listed at http://translationproject.org/team/index.html

Looks like it's still present in the latest fr.po file at
http://translationproject.org/latest/wget/fr.po

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIoIl77M8hyUobTrERApRkAJsGUybOJEDvYidFXc9OWLJ7gIX66QCeL8we
UsjynplN9Um1gmmWUcyZMbU=
=lqbw
-END PGP SIGNATURE-

Re: Wget and Yahoo login?

2008-08-10 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rick Nakroshis wrote:
 Micah,
 
 If you will excuse a quick question about Wget, I'm trying to find out
 if I can use it to download a page from Yahoo that requires me to be
 logged in using my Yahoo profile name and password.  It's a display of a
 CSV file, and the only wrinkle is trying to get past the Yahoo login.
 
 Try as I may, I just can't seem to find anything about Wget and Yahoo. 
 Any suggestions or pointers?

Hi Rick,

In the future, it's better if you post questions to the mailing list at
wget@sunsite.dk; I don't always have time to respond.

The easiest way to do what you want may be to log in using your browser,
and then tell Wget to use the cookies from your browser, using
- --load-cookies=path-to-browser's-cookies. Of course, this only works
if your browser saves its cookies in the standard text format (Firefox
prior to version 3 will do this), or can export to that format (note
that someone contributed a patch to allow Wget to work with Firefox 3
cookies; it's linked from http://wget.addictivecode.org/, it's
unoffocial so I can't vouch for its quality).

Otherwise, you can perform the login using Wget, saving the cookies to a
file of your choice, using --post-data=..., --save-cookies=cookies.txt,
and probably --keep-session-cookies. This will require that you know
what data to place in --post-data, which generally requires that you dig
around in the HTML to find the right form field names, and where to post
them.

For instance, if you find a form like the following within the page
containing the log-in form:

form action=/doLogin.php method=POST
  input type=text name=s-login
  input type=password name=s-pass
/form

then you need to do something like:

  $ wget --post-data='s-login=USERNAMEs-pass=PASSWORD' \
--save-cookies=my-cookies.txt --keep-session-cookies \
http://HOSTNAME/doLogin.php

(Note that you _don't_ necessarily send the information to the page that
 had the login page: you send it to the spot mentioned in the action
attribute of the password form.)

Once this is done, you _should_ be able to perform further operations
with Wget as if you're logged in, by using

  $ wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \
--keep-session-cookies ...

(I'm going to put this up on the Wgiki Faq now, at
http://wget.addictivecode.org/FrequentlyAskedQuestions)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIn09A7M8hyUobTrERAu04AJ9EgRoBBhvNCDwOt87f91p+HpWktACdFgMM
KEfliBtfrPBbh/XdvusEPiw=
=qlGZ
-END PGP SIGNATURE-

Re: WGET Date-Time

2008-08-07 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andreas Weller wrote:
 Hi!
 I use wget to download files from a ftp server in a bash script.
 For example:
 touch last.time
 wget -nc ftp://[]/*.txt .
 find -newer last.time
 
 This fails if the files on the FTP server are older than my last.time. So I 
 want
 wget to set file date/time to the local creation time not the server's...
 
 How to do this?

You can't, currently. This behavior is intended to support Wget's
timestamping (-N) functionality.

However, I'd accept a patch for an option that disables this.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIm2si7M8hyUobTrERAi9AAJ0f8TUv7TJR6tFsgc4k174rqH6OlgCghCzz
xpemaFdQhODIm0SGp7rJSRA=
=vDKD
-END PGP SIGNATURE-

Re: Wget scriptability

2008-08-03 Thread Dražen Kačar

Micah Cowan wrote:

 Okay, so there's been a lot of thought in the past, regarding better
 extensibility features for Wget. Things like hooks for adding support
 for traversal of new Content-Types besides text/html, or adding some
 form of JavaScript support, or support for MetaLink. Also, support for
 being able to filter results pre- and post-processing by Wget: for
 example, being able to do some filtering on the HTML to change how Wget
 sees it before parsing for links, but without affecting the actual
 downloaded version; or filtering the links themselves to alter what Wget
 fetches.

 However, another thing that's been vaguely itching at me lately, is the
 fact that Wget's design is not particularly unix-y. Instead of doing one
 thing, and doing it well, it does a lot of things, some well, some not.

It does what various people needed. It wasn't an excercise in writing a
unixy utility. It was a program that solved real problems for real
people.

 But the thing everyone loves about Unix and GNU (and certainly the thing
 that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
 paradigm,

I have always hated that. With a passion.

  - The tools themselves, as much as possible, should be written in an
 easily-hackable scripting language. Python makes a good candidate. Where
 we want efficiency, we can implement modules in C to do the work.

At the time Wget was conceived, that was Tcl's mantra. It failed
miserably. :-)

How about concentrating on the problems listed in your first paragraph
(which is why I quoted it)? Could you show us how would a buch of shell
tools solve them? Or how would a librarized Wget solve them? Or how
would any other paradigm or architecture or whatever solve them?

-- 
 .-.   .-.Yes, I am an agent of Satan, but my duties are largely
(_  \ /  _)   ceremonial.
 |
 |[EMAIL PROTECTED]

Re: Wget scriptability

2008-08-03 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dražen Kačar wrote:
 Micah Cowan wrote:
 
 Okay, so there's been a lot of thought in the past, regarding better
 extensibility features for Wget. Things like hooks for adding support
 for traversal of new Content-Types besides text/html, or adding some
 form of JavaScript support, or support for MetaLink. Also, support for
 being able to filter results pre- and post-processing by Wget: for
 example, being able to do some filtering on the HTML to change how Wget
 sees it before parsing for links, but without affecting the actual
 downloaded version; or filtering the links themselves to alter what Wget
 fetches.
 
 However, another thing that's been vaguely itching at me lately, is the
 fact that Wget's design is not particularly unix-y. Instead of doing one
 thing, and doing it well, it does a lot of things, some well, some not.
 
 It does what various people needed. It wasn't an excercise in writing a
 unixy utility. It was a program that solved real problems for real
 people.

 But the thing everyone loves about Unix and GNU (and certainly the thing
 that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
 paradigm,
 
 I have always hated that. With a passion.

A surprising position from a user of Mutt, whose excellence is due in no
small part to its ability to integrate well with other command utilities
(that is, to pipeline). The power and flexibility of pipelines is
extremely well-established in the Unix world; I feel no need whatsoever
to waste breath arguing for it, particularly when you haven't provided
the reasons you hate it.

For my part, I'm not exaggerating that it's single-handedly responsible
for why I'm a Unix/GNU user at all, and why I continue to highly enjoy
developing on it.

  find -name '*.html' -exec sed -i \
's#http://oldhost/#http://newhost/#g' \;

  ( cat message; echo; echo '-- '; cat ~/.signature ) | \
gpg --clearsign | mail -s 'Report' [EMAIL PROTECTED]

  pic | tbl | eqn | eff-ing | troff -ms

Each one of these demonstrates the enormously powerful technique of
using distinct tools with distinct feature domains, together to form a
cohesive solution for the need. The best part is (with the possible
exception of the troff pipeline), each of these components are
immediately available for use in some other pipeline that does some
other completely different function.

Note, though, that I don't intend that using Piped-Wget would actually
mean the user types in a special pipeline each time he wants to do
something with it. The primary driver would read in some config file
that would tell wget how it should do the piping. You just tweak the
config file when you want to add new functionality.

  - The tools themselves, as much as possible, should be written in an
 easily-hackable scripting language. Python makes a good candidate. Where
 we want efficiency, we can implement modules in C to do the work.
 
 At the time Wget was conceived, that was Tcl's mantra. It failed
 miserably. :-)

Are you claiming that Tcl's failure was due to the ability to integrate
it with C, rather than its abysmal inadequacy as a programming language
(changing it from an ability to integrate with C, to an absolute
requirement to do so in order to get anything accomplished)?

 How about concentrating on the problems listed in your first paragraph
 (which is why I quoted it)? Could you show us how would a buch of shell
 tools solve them? Or how would a librarized Wget solve them? Or how
 would any other paradigm or architecture or whatever solve them?

It should be trivially obvious: you plug them in, rather than wait for
the Wget developers to get around to implementing it.

The thing that both library-ized Wget and pipeline-ized Wget would offer
is the same: extreme flexibility. It puts the users in control of what
Wget does, rather than just perpetually hearing, sorry, Wget can't do
it: you could hack the source, though. :p

The difference between the two is that a pipelined Wget offers this
flexibility to a wider range of users, whereas a library Wget offers it
to C programmers.

Or how would you expect to do these things without a library-ized (at
least) Wget? Implementing them in the core app (at least by default) is
clearly wrong (scope bloat). Giving Wget a plugin architecture is good,
but then there's only as much flexibility as there are hooks.
Libraryizing Wget is equivalent to providing everything as hooks, and
puts the program using it in the driver's seat (and, naturally, there'd
be a wrapper implementation, like curl for libcurl). A suite of
interconnected utilities does the same, but is more accessible to
greater numbers of people. Generally at some expense to efficiency
(aren't all flexible architectures?); but Wget isn't CPU-bound, it's
network-bound.

As mentioned in my original post, this would be a separate project from
Wget. Wget would not be going away (though it seems likely to me that it
would quickly reach a primarily

Re: wget does not like this URL

2008-07-31 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Kevin O'Gorman wrote:
 Is there a reason i get this:
 [EMAIL PROTECTED] Pending $ wget -O foo 
 http://www.littlegolem.net/jsp/info/player_game_list_txt.jsp?plid=1107gtid=hex;
 Cannot specify -r, -p or -N if -O is given.
 Usage: wget [OPTION]... [URL]...
 [EMAIL PROTECTED] Pending $
 
 While I do have -O, I don't have the ones it seems to think I've specified.
 
 Without the -O foo it works fine, but of course puts the results in
 a different place.
 I get the same error message if I use the long-form parameter.

You most likely have timestamping=on in your wgetrc. -N and -O were
disallowed for version 1.11, but were re-enabled for 1.11.3 (I think)
with a warning. The latest version of wget is 1.11.4.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIkf9U7M8hyUobTrERAtkfAJ9g84lMEkzSeLn24cWQA805HZmE8wCfV2Ck
bB5RK4lRlcBbwOSiU4jPwxM=
=K9cv
-END PGP SIGNATURE-

RE: wget-1.11.4 bug

2008-07-26 Thread kuang-cheng chao


Micah Cowan wrote: The thing is, though, those two threads should be running 
wgets under separate processes
 
Yes, the two threads are running wgets under seperate processes with system.
 What operating system are you running? Vista?mipsel-linux with kernel v2.4 
 built from gcc v3.3.5 
 
Best regards,
K.C. Chao
_
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE

Re: wget-1.11.4 bug

2008-07-25 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

kuang-cheng chao wrote:
 Dear Micah:
  
 Thanks for your work of wget.
  
 There is a question about two wgets run simultaneously.
 In method resolve_bind_address, wget assumes that this is called once.
 However, this will cause two domain name with the same ip if two wgets
 run the same method concurrently.

Have you reproduced this, or is this in theory? If the latter, what has
led you to this conclusion? I don't see anything in the code that would
cause this behavior.

Also, please use the mailing list for discussions about Wget. I've added
it to the recipients list.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIiYKF7M8hyUobTrERAr7fAJ0TnkLdEVOMy6wJA3Z1kIYC7dQoMACfZ9hb
x5K6MTzhgVRCdKJwUGnbSRw=
=EcFC
-END PGP SIGNATURE-

RE: wget-1.11.4 bug

2008-07-25 Thread kuang-cheng chao


Micah Cowan wrote:
 Have you reproduced this, or is this in theory? If the latter, what has led 
 you to this conclusion? I don't see anything in the code that would cause 
 this behavior.
I reproduce this. But I can't make sure the really problem is in 
resolve_bind_address.
In the attached message, both api.yougotphogo.com and farm1.static.flickr.com 
get the same ip(74.124.203.218).
The two wget are called from two threads of a program.
 
Best regards,
k.c. chao
 
p.s. 
 
The log is follworing:
 
wget -4 -t 6 
http://api.yougotphoto.com/device/?action=get_device_new_photoapi=2.2api_key=f10df554a958fd10050e2d305241c7a3device_class=2serial_no=000E2EE5676Furl_no=24616cksn=44fe191d6cb4e7807f75938b5d72f07c;
 -O /tmp/webii/ygp_new_photo_list.txt--1999-11-30 00:04:21--  
http://api.yougotphoto.com/device/?action=get_device_new_photoapi=2.2api_key=f10df554a958fd10050e2d305241c7a3device_class=2serial_no=000E2EE5676Furl_no=24616cksn=44fe191d6cb4e7807f75938b5d72f07cResolving
 api.yougotphoto.com... wget -4 -t 6 
http://farm1.static.flickr.com/33/49038824_e4b04b7d9f_b.jpg; -O 
/tmp/webii/24616 74.124.203.218Connecting to 
api.yougotphoto.com|74.124.203.218|:80... --1999-11-30 00:04:22--  
http://farm1.static.flickr.com/33/49038824_e4b04b7d9f_b.jpgResolving 
farm1.static.flickr.com... 74.124.203.218Connecting to 
farm1.static.flickr.com|74.124.203.218|:80... connected. 
_
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vistamkt=en-USform=QBRE

Re: wget-1.11.4 bug

2008-07-25 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

k.c. chao wrote:
 Micah Cowan wrote:
  Have you reproduced this, or is this in theory? If the latter, what has
  led you to this conclusion? I don't see anything in the code that would
  cause this behavior.

 I reproduce this. But I can't make sure the really problem is in
 resolve_bind_address. In the attached message, both
 api.yougotphogo.com and farm1.static.flickr.com get the same
 ip(74.124.203.218).  The two wget are called from two threads of a
 program.

Yeah, I get 68.142.213.135 for the flickr.com address, currently.

The thing is, though, those two threads should be running wgets under
separate processes (I'm not sure how they couldn't be, but if they
somehow weren't that would be using Wget other than how it was designed
to be used).

This problem sounds much more like an issue with the OS's API than an
issue with Wget, to me. But we'd still want to work around it if it were
feasible.

What operating system are you running? Vista?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIirT17M8hyUobTrERAjsuAJ0crMPYIQficu1csou8Tt0jDFKvpQCeNYk3
1FhXl3uUYj2IA53qI1oOJ8A=
=DvdG
-END PGP SIGNATURE-

Re: Wget

2008-07-22 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hor Meng Yoong wrote:
 Hi:
 
   I understand that you are a very busy person. Sorry to disturb you.

Hi; please use the mailing list for support requests. I've copied the
list in my response.

   I am using wget to mirror (using ftp://) a user home directory from a
 unix machine. Wget default to the user's home directory. However, I also
 need to get /etc folder. So, I tried to use ../../../etc. It works but
 the result of the ftpped files are in %2E%2E/ %2E%2E/ %2E%2E
 
 Any means to overcome this, or rename the directory.

Try the -nd option (you may also need -nH). You might prefer to fetch
/etc in a separate invocation from the other things; perhaps with the -P
option to specify a directory name.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIhi5O7M8hyUobTrERAl+YAJ9xaX5NivhEfzJLHKD5T3qs0nZuOACgg0eC
IqFZMlz8obK+loKyQ6vXCWo=
=gNqH
-END PGP SIGNATURE-

Re: WGET bug...

2008-07-11 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

HARPREET SAWHNEY wrote:
 Hi,
 
 I am getting a strange bug when I use wget to download a binary file
 from a URL versus when I manually download.
 
 The attached ZIP file contains two files:
 
 05.upc --- manually downloaded
 dum.upc--- downloaded through wget
 
 wget adds a number of ascii characters to the head of the file and seems
 to delete a similar number from the tail.
 
 So the file sizes are the same but the addition and deletion renders
 the file useless.
 
 Could you please direct me on if I should be using some specific
 option to avoind this problem?

In the future, it's useful to mention which version of Wget you're using.

The problem you're having is that the server is adding the extra HTML at
the front of your session, and then giving you the file contents anyway.
It's a bug in the PHP code that serves the file.

You're getting this extra content because you are not logged in when
you're fetching it. You need to have Wget send a cookie with an
login-session information, and then the server will probably stop
sending the corrupting information at the head of the file. The site
does not appear to use HTTP's authentication mechanisms, so the
[EMAIL PROTECTED] bit in the URL doesn't do you any good. It uses
Forms-and-cookies authentication.

Hopefully, you're using a browser that stores its cookies in a text
format, or that is capable of exporting to a text format. In that case,
you can just ensure that you're logged in in your browser, and use the
- --load-cookies=cookies.txt option to Wget to use the same session
information.

Otherwise, you'll need to use --save-cookies with Wget to simulate the
login form post, which is tricky and requires some understanding of HTML
Forms.

- --
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFId9Vy7M8hyUobTrERAjCWAJ9niSjC5YdBDNcAbnBFWZX6D8AO7gCeM8nE
i8jn5i5Y6wLX1g3Q2hlDgcM=
=uOke
-END PGP SIGNATURE-

Re: WGET bug...

2008-07-11 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

HARPREET SAWHNEY wrote:
 Hi,
 
 Thanks for the prompt response.
 
 I am using
 
 GNU Wget 1.10.2
 
 I tried a few things on your suggestion but the problem remains.
 
 1. I exported the cookies file in Internet Explorer and specified
 that in the Wget command line. But same error occurs.
 
 2. I have an open session on the site with my username and password.
 
 3. I also tried running wget while I am downloading a file from the
 IE session on the site, but the same error.

Sounds like you'll need to get the appropriate cookie by using Wget to
login to the website. This requires site-specific information from the
user-login form page, though, so I can't help you without that.

If you know how to read some HTML, then you can find the HTML form used
for posting username/password stuff, and use

wget --keep-session-cookies --save-cookies=cookies.txt \
- --post-data='username=foopassword=bar' ACTION

Where ACTION is the value of the form's action field, USERNAME and
PASSWORD (and possibly further required values) are field names from the
HTML form, and FOO and BAR is the username/password.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFId+w97M8hyUobTrERAmLsAJ91231iGeO/albrgRuuUCRp8zFcnwCgiX3H
fDp2J2oTBKlxW17eQ2jaCAA=
=Khmi
-END PGP SIGNATURE-

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Tony Lewis

Coombe, Allan David (DPS) wrote:

 However, the case of the files on disk is still mixed - so I assume that
 wget is not using the URL it originally requested (harvested from the
 HTML?) to create directories and files on disk.  So what is it using? A
 http header (if so, which one??).

I think wget uses the case from the HTML page(s) for the file name; your
proxy would need to change the URLs in the HTML pages to lower case too.

Tony

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Coombe, Allan David (DPS) wrote:
 
 However, the case of the files on disk is still mixed - so I assume
 that wget is not using the URL it originally requested (harvested
 from the HTML?) to create directories and files on disk.  So what
 is it using? A http header (if so, which one??).
 
 I think wget uses the case from the HTML page(s) for the file name;
 your proxy would need to change the URLs in the HTML pages to lower
 case too.

My understanding from David's post is that he claimed to have been doing
just that:

 I modified the response from the web site to lowercase the urls in
 the html (actually I lowercased the whole response) and the data that
 wget put on disk was fully lowercased - problem solved - or so I
 thought.

My suspicion is it's not quite working, though, as otherwise
where would Wget be getting the mixed-case URLs?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz
Vl7REl1hNVNqdBrLqoygrcE=
=jlBN
-END PGP SIGNATURE-

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Coombe, Allan David (DPS)

Sorry Guys - just an ID 10 T error on my part.

I think I need to change 2 things in the proxy server.

1.  URLs in the HTML being returned to wget - this works OK
2.  The Content-Location header used when the web server reports a
301 Moved Permanently response - I think this works OK.

When I reported that it wasn't working I hadn't done both at the same
time.

Cheers

Allan

-Original Message-
From: Micah Cowan [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 25 June 2008 6:44 AM
To: Tony Lewis
Cc: Coombe, Allan David (DPS); 'Wget'
Subject: Re: Wget 1.11.3 - case sensetivity and URLs


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Coombe, Allan David (DPS) wrote:
 
 However, the case of the files on disk is still mixed - so I assume 
 that wget is not using the URL it originally requested (harvested 
 from the HTML?) to create directories and files on disk.  So what is 
 it using? A http header (if so, which one??).
 
 I think wget uses the case from the HTML page(s) for the file name; 
 your proxy would need to change the URLs in the HTML pages to lower 
 case too.

My understanding from David's post is that he claimed to have been doing
just that:

 I modified the response from the web site to lowercase the urls in the

 html (actually I lowercased the whole response) and the data that wget

 put on disk was fully lowercased - problem solved - or so I thought.

My suspicion is it's not quite working, though, as otherwise where would
Wget be getting the mixed-case URLs?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz
Vl7REl1hNVNqdBrLqoygrcE=
=jlBN
-END PGP SIGNATURE-

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-22 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Coombe, Allan David (DPS) wrote:
 OK - now I am confused.
 
 I found a perl based http proxy (named http::proxy funnily enough)
 that has filters to change both the request and response headers and
 data.  I modified the response from the web site to lowercase the urls
 in the html (actually I lowercased the whole response) and the data that
 wget put on disk was fully lowercased - problem solved - or so I thought.
 
 However, the case of the files on disk is still mixed - so I assume that
 wget is not using the URL it originally requested (harvested from the
 HTML?) to create directories and files on disk.  So what is it using? A
 http header (if so, which one??).

I think you're missing something on your end; I couldn't begin to tell
you what. Running with --debug will likely be informative.

Wget uses the URL that successfully results in a file download. If the
files on disk have mixed case, then it's because it was the result of a
mixed-case request from Wget (which, in turn, must have either resulted
from an explicit argument, or from HTML content).

The only exception to the above is when you explicitly enable
- --content-disposition support, in which case Wget will use any filename
specified in a Content-Disposition header. Those are virtually never
issued, except for CGI-based downloads (and you have to explicitly
enable it).

- --
Good luck!
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIXe0Z7M8hyUobTrERAkF5AJ9FOkx5XQJCx9vkTV9xr2zbYzp4jwCffrec
zhdtjp59GOwt07YgvtolM8o=
=FZ3m
-END PGP SIGNATURE-

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-21 Thread Coombe, Allan David (DPS)

OK - now I am confused.

I found a perl based http proxy (named http::proxy funnily enough)
that has filters to change both the request and response headers and
data.  I modified the response from the web site to lowercase the urls
in the html (actually I lowercased the whole response) and the data that
wget put on disk was fully lowercased - problem solved - or so I
thought.

However, the case of the files on disk is still mixed - so I assume that
wget is not using the URL it originally requested (harvested from the
HTML?) to create directories and files on disk.  So what is it using? A
http header (if so, which one??).

Any ideas??

Cheers
Allan

Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-20 Thread Michelle Konzack

Hello Stefan,

I have a question:

Am 2008-06-18 12:17:12, schrieb Stefan Nowak:
 wget \
 --page-requisites \
 --html-extension \
 --convert-links \
 --span-hosts \
 --no-check-certificate \
 --debug \
 https://help.ubuntu.com/community/MacBookPro/  log.txt

Why do you use

 log.txt
instead of
--output-file=log.txt
or
--append-output=log.txt

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
24V Electronic Engineer
Tamay Dogan Network
Debian GNU/Linux Consultant


-- 
Linux-User #280138 with the Linux Counter, http://counter.li.org/
# Debian GNU/Linux Consultant #
Michelle Konzack   Apt. 917  ICQ #328449886
+49/177/935194750, rue de Soultz MSN LinuxMichi
+33/6/61925193 67100 Strasbourg/France   IRC #Debian (irc.icq.com)


signature.pgp
Description: Digital signature

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread Coombe, Allan David (DPS)

Thanks averyone for the contributions.

Ultimately, our purpose is to process documents from the site into our
search database, so probably the most important thing is to limit the
number of files being processed.  The case of  the URLs in the html
probably wouldn't cause us much concern, but I could see that it might
be useful to convert a site for mirroring from a non-case sensetive
(windows) environment to a case sensetive (li|u)nix one - this would
need to include translation of urls in content as well as filenames on
disk.

In the meantime - does anyone know of a proxy server that could
translate urls from mixed case to lower case.  I thought that if we
downloaded using wget via such a proxy server we might get the
appropriate result.  

The other alternative we were thinking of was to post process the files
with symlinks for all mixed case versions of files and directories (I
think someone already suggested this - greate minds and all that...). I
assume that wget would correctly use the symlink to determine the
time/date stamp of the file for determining if it requires updating (or
would it use the time/date stamp of the symlink?). I also assume that if
wget downloaded the file it would overwrite the symlink and we would
have to run our convert files to symlinks process again.

Just to put it in perspective, the actual site is approximately 45gb
(that's what the administrator said) and wget downloaded  100gb
(463,000 files) when I did the first process.

Cheers
Allan

-Original Message-
From: Micah Cowan [mailto:[EMAIL PROTECTED] 
Sent: Saturday, 14 June 2008 7:30 AM
To: Tony Lewis
Cc: Coombe, Allan David (DPS); 'Wget'
Subject: Re: Wget 1.11.3 - case sensetivity and URLs


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Micah Cowan wrote:
 
 Unfortunately, nothing really comes to mind. If you'd like, you could

 file a feature request at 
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option

 asking Wget to treat URLs case-insensitively.
 
 To have the effect that Allan seeks, I think the option would have to 
 convert all URIs to lower case at an appropriate point in the process.

 I think you probably want to send the original case to the server 
 (just in case it really does matter to the server). If you're going to

 treat different case URIs as matching then the lower-case version will

 have to be stored in the hash. The most important part (from the 
 perspective that Allan voices) is that the versions written to disk 
 use lower case characters.

Well, that really depends. If it's doing a straight recursive download,
without preexisting local files, then all that's really necessary is to
do lookups/stores in the blacklist in a case-normalized manner.

If preexisting files matter, then yes, your solution would fix it.
Another solution would be to scan directory contents for the first name
that matches case insensitively. That's obviously much less efficient,
but has the advantage that the file will match at least one of the
real cases from the server.

As Matthias points out, your lower-case normalization solution could be
achieved in a more general manner with a hook. Which is something I was
planning on introducing perhaps in 1.13 anyway (so you could, say, run
sed on the filenames before Wget uses them), so that's probably the
approach I'd take. But probably not before 1.13, even if someone
provides a patch for it in time for 1.12 (too many other things to focus
on, and I'd like to introduce the external command hooks as a suite,
if possible).

OTOH, case normalization in the blacklists would still be useful, in
addition to that mechanism. Could make another good addition for 1.13
(because it'll be more useful in combination with the rename hooks).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
nVYivipui+0TRmmK04kD2JE=
=OMsD
-END PGP SIGNATURE-

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w

a simple url-rewriting conf should fix the problem, wihout touch the file system
everything can be done server side

Best Regards

On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS)
[EMAIL PROTECTED] wrote:
 Thanks averyone for the contributions.

 Ultimately, our purpose is to process documents from the site into our
 search database, so probably the most important thing is to limit the
 number of files being processed.  The case of  the URLs in the html
 probably wouldn't cause us much concern, but I could see that it might
 be useful to convert a site for mirroring from a non-case sensetive
 (windows) environment to a case sensetive (li|u)nix one - this would
 need to include translation of urls in content as well as filenames on
 disk.

 In the meantime - does anyone know of a proxy server that could
 translate urls from mixed case to lower case.  I thought that if we
 downloaded using wget via such a proxy server we might get the
 appropriate result.

 The other alternative we were thinking of was to post process the files
 with symlinks for all mixed case versions of files and directories (I
 think someone already suggested this - greate minds and all that...). I
 assume that wget would correctly use the symlink to determine the
 time/date stamp of the file for determining if it requires updating (or
 would it use the time/date stamp of the symlink?). I also assume that if
 wget downloaded the file it would overwrite the symlink and we would
 have to run our convert files to symlinks process again.

 Just to put it in perspective, the actual site is approximately 45gb
 (that's what the administrator said) and wget downloaded  100gb
 (463,000 files) when I did the first process.

 Cheers
 Allan

 -Original Message-
 From: Micah Cowan [mailto:[EMAIL PROTECTED]
 Sent: Saturday, 14 June 2008 7:30 AM
 To: Tony Lewis
 Cc: Coombe, Allan David (DPS); 'Wget'
 Subject: Re: Wget 1.11.3 - case sensetivity and URLs


 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Tony Lewis wrote:
 Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could

 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option

 asking Wget to treat URLs case-insensitively.

 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the process.

 I think you probably want to send the original case to the server
 (just in case it really does matter to the server). If you're going to

 treat different case URIs as matching then the lower-case version will

 have to be stored in the hash. The most important part (from the
 perspective that Allan voices) is that the versions written to disk
 use lower case characters.

 Well, that really depends. If it's doing a straight recursive download,
 without preexisting local files, then all that's really necessary is to
 do lookups/stores in the blacklist in a case-normalized manner.

 If preexisting files matter, then yes, your solution would fix it.
 Another solution would be to scan directory contents for the first name
 that matches case insensitively. That's obviously much less efficient,
 but has the advantage that the file will match at least one of the
 real cases from the server.

 As Matthias points out, your lower-case normalization solution could be
 achieved in a more general manner with a hook. Which is something I was
 planning on introducing perhaps in 1.13 anyway (so you could, say, run
 sed on the filenames before Wget uses them), so that's probably the
 approach I'd take. But probably not before 1.13, even if someone
 provides a patch for it in time for 1.12 (too many other things to focus
 on, and I'd like to introduce the external command hooks as a suite,
 if possible).

 OTOH, case normalization in the blacklists would still be useful, in
 addition to that mechanism. Could make another good addition for 1.13
 (because it'll be more useful in combination with the rename hooks).

 - --
 Micah J. Cowan
 Programmer, musician, typesetting enthusiast, gamer,
 and GNU Wget Project Maintainer.
 http://micah.cowan.name/
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
 nVYivipui+0TRmmK04kD2JE=
 =OMsD
 -END PGP SIGNATURE-




-- 
-mmw

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w

without touching the file system

On Thu, Jun 19, 2008 at 9:23 AM, mm w [EMAIL PROTECTED] wrote:
 a simple url-rewriting conf should fix the problem, wihout touch the file 
 system
 everything can be done server side

 Best Regards

 On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS)
 [EMAIL PROTECTED] wrote:
 Thanks averyone for the contributions.

 Ultimately, our purpose is to process documents from the site into our
 search database, so probably the most important thing is to limit the
 number of files being processed.  The case of  the URLs in the html
 probably wouldn't cause us much concern, but I could see that it might
 be useful to convert a site for mirroring from a non-case sensetive
 (windows) environment to a case sensetive (li|u)nix one - this would
 need to include translation of urls in content as well as filenames on
 disk.

 In the meantime - does anyone know of a proxy server that could
 translate urls from mixed case to lower case.  I thought that if we
 downloaded using wget via such a proxy server we might get the
 appropriate result.

 The other alternative we were thinking of was to post process the files
 with symlinks for all mixed case versions of files and directories (I
 think someone already suggested this - greate minds and all that...). I
 assume that wget would correctly use the symlink to determine the
 time/date stamp of the file for determining if it requires updating (or
 would it use the time/date stamp of the symlink?). I also assume that if
 wget downloaded the file it would overwrite the symlink and we would
 have to run our convert files to symlinks process again.

 Just to put it in perspective, the actual site is approximately 45gb
 (that's what the administrator said) and wget downloaded  100gb
 (463,000 files) when I did the first process.

 Cheers
 Allan

 -Original Message-
 From: Micah Cowan [mailto:[EMAIL PROTECTED]
 Sent: Saturday, 14 June 2008 7:30 AM
 To: Tony Lewis
 Cc: Coombe, Allan David (DPS); 'Wget'
 Subject: Re: Wget 1.11.3 - case sensetivity and URLs


 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Tony Lewis wrote:
 Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could

 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option

 asking Wget to treat URLs case-insensitively.

 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the process.

 I think you probably want to send the original case to the server
 (just in case it really does matter to the server). If you're going to

 treat different case URIs as matching then the lower-case version will

 have to be stored in the hash. The most important part (from the
 perspective that Allan voices) is that the versions written to disk
 use lower case characters.

 Well, that really depends. If it's doing a straight recursive download,
 without preexisting local files, then all that's really necessary is to
 do lookups/stores in the blacklist in a case-normalized manner.

 If preexisting files matter, then yes, your solution would fix it.
 Another solution would be to scan directory contents for the first name
 that matches case insensitively. That's obviously much less efficient,
 but has the advantage that the file will match at least one of the
 real cases from the server.

 As Matthias points out, your lower-case normalization solution could be
 achieved in a more general manner with a hook. Which is something I was
 planning on introducing perhaps in 1.13 anyway (so you could, say, run
 sed on the filenames before Wget uses them), so that's probably the
 approach I'd take. But probably not before 1.13, even if someone
 provides a patch for it in time for 1.12 (too many other things to focus
 on, and I'd like to introduce the external command hooks as a suite,
 if possible).

 OTOH, case normalization in the blacklists would still be useful, in
 addition to that mechanism. Could make another good addition for 1.13
 (because it'll be more useful in combination with the rename hooks).

 - --
 Micah J. Cowan
 Programmer, musician, typesetting enthusiast, gamer,
 and GNU Wget Project Maintainer.
 http://micah.cowan.name/
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
 nVYivipui+0TRmmK04kD2JE=
 =OMsD
 -END PGP SIGNATURE-




 --
 -mmw




-- 
-mmw

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread Tony Lewis

mm w wrote:

 a simple url-rewriting conf should fix the problem, wihout touch the file 
 system
 everything can be done server side

Why do you assume the user of wget has any control over the server from which 
content is being downloaded?

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w

not al, but in this particular case I pretty sure they have

On Thu, Jun 19, 2008 at 10:42 AM, Tony Lewis [EMAIL PROTECTED] wrote:
 mm w wrote:

 a simple url-rewriting conf should fix the problem, wihout touch the file 
 system
 everything can be done server side

 Why do you assume the user of wget has any control over the server from which 
 content is being downloaded?





-- 
-mmw

Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-18 Thread Valentin

Dear Stefan,

If you take a look at the source of the page, you'll see this:

meta name=robots content=index,nofollow

Simply add -e robots=off to your arguments and wget will ignore any
robots.txt files or tags. With that it should download everything you
want. (I did not find this myself, credits go to sxav for pointing this
out. ;)

Cheers,

Valentin


-- 
The last time someone listened to a Bush, a bunch of people wandered in
the desert for 40 years.

Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-18 Thread Ryan Schmidt


On Jun 18, 2008, at 5:17 AM, Stefan Nowak wrote:


where do I set the locale of the CLI environment of MacOSX?


You should set the LANG environment variable to the desired locale,  
and one which is supported on your system; you can look at the  
directories in /usr/share/locale to see what locales are available.


For example, if you want American English, set LANG to en_US.

In the Bash shell, you can type export LANG=en_US

In the Tcsh shell, you can type setenv LANG en_US

To find out which shell you use, type echo $SHELL

Re: wget doesn't load page-requisites from a) dynamic web page b) through https

2008-06-18 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ryan Schmidt wrote:
 For example, if you want American English, set LANG to en_US.
 
 In the Bash shell, you can type export LANG=en_US
 
 In the Tcsh shell, you can type setenv LANG en_US
 
 To find out which shell you use, type echo $SHELL

FYI: It's not in any current release, but current mainline has support
for the special [EMAIL PROTECTED] for LANGUAGE (still may need to set
LANG=en_US or something). This causes all quoted strings to be rendered
in boldface, using terminal escape sequences. I've found it pleasant to
use that setting for my own purposes.

The [EMAIL PROTECTED] LANGUAGE setting is also supported (converts to proper
left/right-quotemarks, but no terminal sequences); but I've rigged
LANG=en_US to have the same effect ([EMAIL PROTECTED] is copied to en_US.po).

Again, this is only in the mainline repo, and not in any release.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIWXvT7M8hyUobTrERAmedAJ44nMxqJCyIBox1LDv/FOibkCslIACeLoS3
Beb0toZwvx29J4Sa3AZk62k=
=Sreb
-END PGP SIGNATURE-

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-16 Thread mm w

On Sat, Jun 14, 2008 at 4:30 PM, Tony Lewis [EMAIL PROTECTED] wrote:
 mm w wrote:

 Hi, after all, after all it's only my point of view :D
 anyway,

 /dir/file,
 dir/File, non-standard
 Dir/file, non-standard
 and /Dir/File non-standard

 According to RFC 2396: The path component contains data, specific to the 
 authority (or the scheme if there is no authority component), identifying the 
 resource within the scope of that scheme and authority.

 In other words, those names are well within the standard when the server 
 understands them. As far as I know, there is nothing in Internet standards 
 restricting mixed case paths.

:) read again, nobody does except some punk-head folks

 that's it, if the server manages non-standard URL, it's not my
 concern, for me it doesn't exist

 Oh. I see. You're writing to say that wget should only implement features 
 that are meaningful to you. Thanks for your narcissistic input.

no i'm not such a jerk, a simple grep/sed on the website source to
remove the malicious URL should be fine,
or an HTTP redirection when the  malicious non-standard URL is called

in other hand, if wget changes every links in lowercase, some people
should have the opposite problem
a golden rule: never distributing mixed-case URL (to your users), a
simple respect for them and everything in lower-case


 Tony





-- 
-mmw

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-14 Thread Tony Lewis

mm w wrote:

 Hi, after all, after all it's only my point of view :D
 anyway,
 
 /dir/file,
 dir/File, non-standard
 Dir/file, non-standard
 and /Dir/File non-standard

According to RFC 2396: The path component contains data, specific to the 
authority (or the scheme if there is no authority component), identifying the 
resource within the scope of that scheme and authority.

In other words, those names are well within the standard when the server 
understands them. As far as I know, there is nothing in Internet standards 
restricting mixed case paths.

 that's it, if the server manages non-standard URL, it's not my
 concern, for me it doesn't exist

Oh. I see. You're writing to say that wget should only implement features that 
are meaningful to you. Thanks for your narcissistic input.

Tony

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis

Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could
 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option
 asking Wget to treat URLs case-insensitively.

To have the effect that Allan seeks, I think the option would have to convert 
all URIs to lower case at an appropriate point in the process. I think you 
probably want to send the original case to the server (just in case it really 
does matter to the server). If you're going to treat different case URIs as 
matching then the lower-case version will have to be stored in the hash. The 
most important part (from the perspective that Allan voices) is that the 
versions written to disk use lower case characters.

Tony

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread mm w

standard: the URL are case-insensitive

you can adapt your software because some people don't respect standard,
we are not anymore in 90's, let people doing crapy things deal with
their crapy world

Cheers!

On Fri, Jun 13, 2008 at 2:08 PM, Tony Lewis [EMAIL PROTECTED] wrote:
 Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could
 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option
 asking Wget to treat URLs case-insensitively.

 To have the effect that Allan seeks, I think the option would have to convert 
 all URIs to lower case at an appropriate point in the process. I think you 
 probably want to send the original case to the server (just in case it really 
 does matter to the server). If you're going to treat different case URIs as 
 matching then the lower-case version will have to be stored in the hash. The 
 most important part (from the perspective that Allan voices) is that the 
 versions written to disk use lower case characters.

 Tony





-- 
-mmw

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Micah Cowan wrote:
 
 Unfortunately, nothing really comes to mind. If you'd like, you
 could file a feature request at 
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an
 option asking Wget to treat URLs case-insensitively.
 
 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the
 process. I think you probably want to send the original case to the
 server (just in case it really does matter to the server). If you're
 going to treat different case URIs as matching then the lower-case
 version will have to be stored in the hash. The most important part
 (from the perspective that Allan voices) is that the versions written
 to disk use lower case characters.

Well, that really depends. If it's doing a straight recursive download,
without preexisting local files, then all that's really necessary is to
do lookups/stores in the blacklist in a case-normalized manner.

If preexisting files matter, then yes, your solution would fix it.
Another solution would be to scan directory contents for the first name
that matches case insensitively. That's obviously much less efficient,
but has the advantage that the file will match at least one of the
real cases from the server.

As Matthias points out, your lower-case normalization solution could be
achieved in a more general manner with a hook. Which is something I was
planning on introducing perhaps in 1.13 anyway (so you could, say, run
sed on the filenames before Wget uses them), so that's probably the
approach I'd take. But probably not before 1.13, even if someone
provides a patch for it in time for 1.12 (too many other things to focus
on, and I'd like to introduce the external command hooks as a suite,
if possible).

OTOH, case normalization in the blacklists would still be useful, in
addition to that mechanism. Could make another good addition for 1.13
(because it'll be more useful in combination with the rename hooks).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
nVYivipui+0TRmmK04kD2JE=
=OMsD
-END PGP SIGNATURE-

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Steven M. Schweda

   In the VMS world, where file name case may matter, but usually
doesn't, the normal scheme is to preserve case when creating files, but
to do case-insensitive comparisons on file names.

From Tony Lewis:

 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the process.

   I think that that's the wrong way to look at it.  Implementation
details like name hashing may also need to be adjusted, but this
shouldn't be too hard.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis

mm w wrote:

 standard: the URL are case-insensitive

 you can adapt your software because some people don't respect standard,
 we are not anymore in 90's, let people doing crapy things deal with
 their crapy world

You obviously missed the point of the original posting: how can one 
conveniently mirror a site whose server uses case insensitive names onto a 
server that uses case sensitive names.

If the original site has the URI strings /dir/file, dir/File, Dir/file, 
and /Dir/File, the same local file will be returned. However, wget will treat 
those as unique directories and files and you wind up with four copies.

Allan asked if there is a way to have wget just create one copy and proposed 
one way that might accomplish that goal.

Tony

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis

Steven M. Schweda wrote:

 From Tony Lewis:
  To have the effect that Allan seeks, I think the option would have to
  convert all URIs to lower case at an appropriate point in the process.

   I think that that's the wrong way to look at it.  Implementation
 details like name hashing may also need to be adjusted, but this
 shouldn't be too hard.

OK. How would you normalize the names?

Tony

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread mm w

Hi, after all, after all it's only my point of view :D
anyway,

/dir/file,
dir/File, non-standard
Dir/file, non-standard
and /Dir/File non-standard

that's it, if the server manages non-standard URL, it's not my
concern, for me it doesn't exist


On Fri, Jun 13, 2008 at 3:12 PM, Tony Lewis [EMAIL PROTECTED] wrote:
 mm w wrote:

 standard: the URL are case-insensitive

 you can adapt your software because some people don't respect standard,
 we are not anymore in 90's, let people doing crapy things deal with
 their crapy world

 You obviously missed the point of the original posting: how can one 
 conveniently mirror a site whose server uses case insensitive names onto a 
 server that uses case sensitive names.

 If the original site has the URI strings /dir/file, dir/File, Dir/file, 
 and /Dir/File, the same local file will be returned. However, wget will 
 treat those as unique directories and files and you wind up with four copies.

 Allan asked if there is a way to have wget just create one copy and proposed 
 one way that might accomplish that goal.

 Tony





-- 
-mmw

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-12 Thread Matthias Vill


Hi list!

saddly I couldn't find the E-Mail of Allan (maybe because I'm atached by 
the news gateway) so this is a list-only-post.


Micah Cowan wrote:

Hi Allan,

You'll generally get better results if you post to the mailing list
(wget@sunsite.dk). I've added it to the recipients list.

Coombe, Allan David (DPS) wrote:

Hi Micah,



First some context
We are using wget 1.11.3 to mirror a web site so we can do some offline
processing on it.  The mirror is on a Solaris 10 x86 server.



The problem we are getting appears to be because the URLs in the HTML
pages that are harvested by wget for downloading have mixed case (the
site we are mirroring is running on a Windows 2000 server using IIS) and
the directory structure created on the mirror have 'duplicate'
directories because of the mixed case.



For example,  the URLs in HTML pages /Senate/committees/index.htm and
/senate/committees/index.htm refer to the same file but wget creates 2
different directory structures on the mirror site for these URLs.


Ok... at this point I need to ask whether you try to mirror or just 
backup the site.
The main problem is easy: the moment you want a working mirror you need 
those mixed-case files or rewrite the url to a unique casing.
At this point it seems to be most practical to either introduce a hook 
like --restrict-file-names to modify the name of the local copy and the 
links inside the downloaded files in the same way.
An other option is to create symlinks for the different directory cases. 
That would safe half the overhead, i guess.

To create such a symlink structure you could use the output of 
find /mirror/basedir -type d | sort -f
 hope that helps.

Matthias

Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-11 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Allan,

You'll generally get better results if you post to the mailing list
(wget@sunsite.dk). I've added it to the recipients list.

Coombe, Allan David (DPS) wrote:
 Hi Micah,
 
 First some context…
 We are using wget 1.11.3 to mirror a web site so we can do some offline
 processing on it.  The mirror is on a Solaris 10 x86 server.
 
 The problem we are getting appears to be because the URLs in the HTML
 pages that are harvested by wget for downloading have mixed case (the
 site we are mirroring is running on a Windows 2000 server using IIS) and
 the directory structure created on the mirror have 'duplicate'
 directories because of the mixed case.
 
 For example,  the URLs in HTML pages /Senate/committees/index.htm and
 /senate/committees/index.htm refer to the same file but wget creates 2
 different directory structures on the mirror site for these URLs.
 
 This appears to be a fairly basic thing, but we can't see any wget
 options that allow us to treat URLs case insensetively.
 
 We don't really want to post-process the site just to merge the files
 and directories with different case.

Unfortunately, nothing really comes to mind. If you'd like, you could
file a feature request at
https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option
asking Wget to treat URLs case-insensitively. Finding local files
case-insensitively, on a case-sensitive filesystem, would be a PITA; but
adding and looking up URLs in the internal blacklist hash wouldn't be
too hard. I probably wouldn't get to that for a while, though.

Another useful option might be to change the name of index files, so
that, for instance, you could have URLs like http://foo/ result in
foo/index.htm or foo/default.html, rather than foo/index.html.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUG937M8hyUobTrERAqq2AJ48mGvcFCSxnouTFqYTuRHzVgwYdgCeLegI
vkdzf3Lu+Vn5diCOHk5CRhc=
=IlG9
-END PGP SIGNATURE-

Re: wget 1.11.1 make test fails

2008-04-04 Thread Hrvoje Niksic

Alain Guibert [EMAIL PROTECTED] writes:

  On Wednesday, April 2, 2008 at 23:09:52 +0200, Hrvoje Niksic wrote:

 Micah Cowan [EMAIL PROTECTED] writes:
 It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME

 The libc 5.4.33 fnmatch() supports FNM_PATHNAME, and there is code
 apparently intending to return FNM_NOMATCH on a slash. But this code
 seems to be rather broken.

Or it could be that you're picking up a different fnmatch.h that sets
up a different value for FNM_PATHNAME.  Do you have more than one
fnmatch.h installed on your system?

Re: wget 1.11.1 make test fails

2008-04-04 Thread Hrvoje Niksic

Alain Guibert [EMAIL PROTECTED] writes:

 Maybe you could put a breakpoint in fnmatch and see what goes wrong?

 The for loop intended to eat several characters from the string also
 advances the pattern pointer. This one reaches the end of the pattern,
 and points to a NUL. It is not a '*' anymore, so the loop exits
 prematurely. Just below, a test for NUL returns 0.

Thanks for the analysis.  Looking at the current fnmatch code in
gnulib, it seems that the fix is to change that NUL test to something
like:

  if (c == '\0')
{
  /* The wildcard(s) is/are the last element of the pattern.
 If the name is a file name and contains another slash
 this means it cannot match. */
  int result = (flags  FNM_PATHNAME) == 0 ? 0 : FNM_NOMATCH;
  if (flags  FNM_PATHNAME)
{
  if (!strchr (n, '/'))
result = 0;
}
  return result;
}

But I'm not at all sure that it covers all the needed cases.  Maybe we
should simply switch to gnulib-provided fnmatch?  Unfortunately that
one is quite complex and quite hard for the '**' extension Micah
envisions.  There might be other fnmatch implementations out there in
GNU which are debugged but still simpler than the gnulib/glibc one.


It's kind of ironic that while the various system fnmatches were
considered broken, the one Wget was using (for many years
unconditionally!) was also broken.

Re: wget 1.11.1 make test fails

2008-04-04 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hrvoje Niksic wrote:
 Alain Guibert [EMAIL PROTECTED] writes:
 
 Maybe you could put a breakpoint in fnmatch and see what goes wrong?
 The for loop intended to eat several characters from the string also
 advances the pattern pointer. This one reaches the end of the pattern,
 and points to a NUL. It is not a '*' anymore, so the loop exits
 prematurely. Just below, a test for NUL returns 0.
 
 Thanks for the analysis.  Looking at the current fnmatch code in
 gnulib, it seems that the fix is to change that NUL test to something
 like:
 
   if (c == '\0')
 {
   /* The wildcard(s) is/are the last element of the pattern.
  If the name is a file name and contains another slash
  this means it cannot match. */
   int result = (flags  FNM_PATHNAME) == 0 ? 0 : FNM_NOMATCH;
   if (flags  FNM_PATHNAME)
 {
   if (!strchr (n, '/'))
 result = 0;
 }
   return result;
 }
 
 But I'm not at all sure that it covers all the needed cases.

I'm thinking not: the loop still shouldn't be incrementing n, since that
forces each additional * to match at least one character, doesn't it?
Gnulib's version seems to handle that better.

 Maybe we
 should simply switch to gnulib-provided fnmatch?  Unfortunately that
 one is quite complex and quite hard for the '**' extension Micah
 envisions.  There might be other fnmatch implementations out there in
 GNU which are debugged but still simpler than the gnulib/glibc one.

Maybe. I'm not sure ** would be too hard to add to gnulib's fnmatch,
just have to toggle with the FNM_FILE_NAME tests within the '*' case, if
we see an immediate second '*'. But maybe ** as part of a *?**? sequence
is more complex. I don't think so, though.

The main thing is that we need it to support the invalid sequence stuff.

Hm; I'm not sure we'll ever want fnmatch() to be locale-aware, though.
User-specified match patterns should interpret characters based on the
locale; but the source strings may be in different encodings altogether.
If we solve this by transcoding to the current locale, we may find that
the user's locale doesn't support all of the characters that the
original string's encoding does. Probably we'll need to transcode both
to Unicode before comparison.

In the meantime, though, I think we want a simple byte-by-byte match.
Perhaps it's best to (a) use our custom matcher, ignoring the system's
(so we don't get locale specialness), and (b) fix it, providing as
thorough test coverage as possible.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9jWi7M8hyUobTrERAglwAKCDnpnDjr44Ovgh/oBuzkM4mu/gKACeNnN8
arvFSrCEBatNeO29fzHxuU4=
=QDMp
-END PGP SIGNATURE-

Re: wget 1.11.1 make test fails

2008-04-04 Thread Alain Guibert

 On Thursday, April 3, 2008 at 22:37:41 +0200, Hrvoje Niksic wrote:

 Or it could be that you're picking up a different fnmatch.h that sets
 up a different value for FNM_PATHNAME.  Do you have more than one
 fnmatch.h installed on your system?

I have only /usr/include/fnmatch.h installed, identical to the file in
the libc-5.4.33 tarball, and defining the same values as wget's
src/sysdep.h (even comments are identical). Just my fnmatch.h defines
two more flags, FNM_LEADING_DIR=8 and FNM_CASEFOLD=16, and defines an
FNM_FILE_NAME alias (commented as Preferred GNU name) to
FNM_PATHNAME=1 (the libc code uses only this alias). Anyway I had
noticed your comment about incompatible headers, and double-checked your
little test program also with explicit value 1: same results.


BTW everybody should be able to reproduce the make test failure, on any
system, just by #undefining SYSTEM_FNMATCH in src/sysdep.h


Alain.

Re: wget 1.11.1 make test fails

2008-04-04 Thread Alain Guibert

 On Thursday, April 3, 2008 at 9:14:52 -0700, Micah Cowan wrote:

 Are you certain you rebuilt cmpt.o? This seems pretty unlikely, to me.

Certain: make test after touching src/sysdep.h rebuilds both cmpt.o, the
normal in src/ and the one in tests/. And both those cmpt.o become
784 bytes bigger without SYSTEM_FNMATCH.


Alain.

Re: wget 1.11.1 make test fails

2008-04-03 Thread Hrvoje Niksic

Alain Guibert [EMAIL PROTECTED] writes:

 This old system does HAVE_WORKING_FNMATCH_H (and thus
 SYSTEM_FNMATCH).  When #undefining SYSTEM_FNMATCH, the test still
 fails at the very same line. And then it also fails on modern
 systems. I guess this points at the embedded src/cmpt.c:fnmatch()
 replacement?

Well, it would point to a problem with both the fnmatch replacement
and the older system fnmatch.  Our fnmatch (coming from an old
release of Bash, but otherwise very well-tested, both in Bash and
Wget) is careful to special-case '/' only if FNM_PATHNAME is
specified.

Maybe you could put a breakpoint in fnmatch and see what goes wrong?

Re: wget 1.11.1 make test fails

2008-04-03 Thread Alain Guibert

 On Wednesday, April 2, 2008 at 23:09:52 +0200, Hrvoje Niksic wrote:

 Micah Cowan [EMAIL PROTECTED] writes:
 It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME

The libc 5.4.33 fnmatch() supports FNM_PATHNAME, and there is code
apparently intending to return FNM_NOMATCH on a slash. But this code
seems to be rather broken.


| printf(%d\n, fnmatch(foo*, foo/bar, FNM_PATHNAME));
 It should print a non-zero value.

Zero on the old system, FNM_NOMATCH on a recent one.


Alain.

Re: wget fails using proxy with https-protocol

2008-04-03 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 The log shows that:
 
   1. Wget still doesn't wait for the Proxy to ask for authentication,
 before sending Proxy-Authorization headers with its first request.
   2. Apparently, when going through a proxy, Wget now correctly waits to
 receive a challenge from the destination server (as I intended), but
 then _doesn't_ respond to the challenge with an Authorization header,
 instead just treating the (first) 401 as a final header.

Slava, could you perhaps download and install Wget 1.11.1, and try it
with the --auth-no-challenge option? That was added to support a case
when there was a genuine need for Wget's older, less secure
authentication behavior; it's intended to disable the new behavior. It
may or may not fix your problem, and I'd be interested to know which it
is. :)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9QzK7M8hyUobTrERAuOUAJ4ygaAyhihkeM/tG0j7hMexnHJZwwCeKhzi
r3OHfZk8bDZu0DnQljyP7vU=
=6i/0
-END PGP SIGNATURE-

Re: wget 1.11.1 make test fails

2008-04-03 Thread Alain Guibert

 On Thursday, April 3, 2008 at 11:08:27 +0200, Hrvoje Niksic wrote:

 Well, it would point to a problem with both the fnmatch replacement
 and the older system fnmatch.  Our fnmatch (coming from an old
 release of Bash

The fnmatch()es in libc 5.4.33 and in Wget are twins. They differ on
some minor details like FNM_CASEFOLD support, and cosmetic things like
parenthesis around return(code). The part dealing with * in pattern is
functionaly identical.


 Maybe you could put a breakpoint in fnmatch and see what goes wrong?

The for loop intended to eat several characters from the string also
advances the pattern pointer. This one reaches the end of the pattern,
and points to a NUL. It is not a '*' anymore, so the loop exits
prematurely. Just below, a test for NUL returns 0.

The body of the loop, returning FNM_NOMATCH on a slash, is not executed
at all. That isn't moderately broken, is it?


Alain.

fnmatch [Re: wget 1.11.1 make test fails]

2008-04-03 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alain Guibert wrote:
 The for loop intended to eat several characters from the string also
 advances the pattern pointer. This one reaches the end of the pattern,
 and points to a NUL. It is not a '*' anymore, so the loop exits
 prematurely. Just below, a test for NUL returns 0.
 
 The body of the loop, returning FNM_NOMATCH on a slash, is not executed
 at all. That isn't moderately broken, is it?

I haven't stepped through it, but it sure looks broken to my eyes too. I
am tired at the moment, though, so may be missing something.

GNUlib has an fnmatch, which might be worth considering for use; but
AIUI it suffers from the same overly-locale-aware problem that system
fnmatches can suffer from (fnmatch fails when the string isn't encoded
properly for the current locale; we often don't even _know_ the original
encoding, especially for FTP, and mainly want * to match any arbitrary
string of byte values). They were looking for someone to address that issue:

http://lists.gnu.org/archive/html/bug-gnulib/2008-02/msg00019.html

Perhaps, if I'm motivated and somehow scrounge the time, I can fix the
problem in their code, and then use it in ours? :)

Or, if someone else with more time would like to tackle it, I'm sure
that'd also be welcome. :)

I responded to the message linked above with a note that Wget also had a
need for such functionality, along with some questions about the
approach, but hadn't received a response. Maybe I'll try again.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH9XBy7M8hyUobTrERAtReAJ94Ac0ClInQOE7qq7OQxon87zj7JACeOTz3
Lfafi0U2phRDnFqQ2IPSx+s=
=9yU/
-END PGP SIGNATURE-

Re: wget 1.11.1 make test fails

2008-04-02 Thread Alain Guibert

Hello Micah,

 On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote:

 could you try to isolate which part of test_dir_matches_p is failing?

The only failing src/utils.c test_array[] line is:

| { { *COMPLETE, NULL, NULL }, foo/!COMPLETE, false },

I don't understand enough of dir_matches_p() and fnmatch() to guess what
is supposed to happen. But with false replaced by true, this test and
following succeed.

| ALL TESTS PASSED
| Tests run: 7

Of course this test then fails on newer systems.


Alain.

Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic

Alain Guibert [EMAIL PROTECTED] writes:

 Hello Micah,

  On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote:

 could you try to isolate which part of test_dir_matches_p is failing?

 The only failing src/utils.c test_array[] line is:

 | { { *COMPLETE, NULL, NULL }, foo/!COMPLETE, false },

 I don't understand enough of dir_matches_p() and fnmatch() to guess
 what is supposed to happen. But with false replaced by true, this
 test and following succeed.

'*' is not supposed to match '/' in regular fnmatch.

It sounds like a libc problem rather than a gcc problem.  Try
#undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.

Re: wget 1.11.1 make test fails

2008-04-02 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hrvoje Niksic wrote:
 Alain Guibert [EMAIL PROTECTED] writes:
 
 Hello Micah,

  On Monday, March 31, 2008 at 11:39:43 -0700, Micah Cowan wrote:

 could you try to isolate which part of test_dir_matches_p is failing?
 The only failing src/utils.c test_array[] line is:

 | { { *COMPLETE, NULL, NULL }, foo/!COMPLETE, false },

 I don't understand enough of dir_matches_p() and fnmatch() to guess
 what is supposed to happen. But with false replaced by true, this
 test and following succeed.
 
 '*' is not supposed to match '/' in regular fnmatch.

Well, that's assuming you pass it the FNM_PATHNAME flag (which, for
dir_matches_p, we always do).

 It sounds like a libc problem rather than a gcc problem.  Try
 #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.

It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME: I
mean, don't most shells rely on this to handle file globbing and whatnot?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH86+L7M8hyUobTrERApHKAJsFbO8+PtAqFhHJ2Psv1AuKSy17YwCcDsi2
9WHcJ0Pzkc4XmNbcEUCXf6U=
=r8ZV
-END PGP SIGNATURE-

Re: wget fails using proxy with https-protocol

2008-04-02 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 Julien, I've CC'd you, in case you think this might be something you'd
 want to add to your GSoC proposal. If it _is_, it's probably something
 that should be done before the rest, so I can backport it into the 1.11
 branch for a 1.11.2 release (since this is an important regression),
 rather than make people wait for 1.12 to come out (which is where I
 expect the rest of the authorization improvements would go).

Er, on reflection, that's a terrible idea, given that coding for GSoC
doesn't even start until nearly June, and this is a serious regression
that should be fixed as soon as it can be got to.

Still, if you'd like to tackle it out-of-band, that'd be handy. :)

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH87mY7M8hyUobTrERAuyjAJ0XJ8ImAFZ/J49EGQlc+HWWNdxhQACgiK3U
bgyhQErH//V6bDkaeE9mLYM=
=3fn1
-END PGP SIGNATURE-

Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic

Micah Cowan [EMAIL PROTECTED] writes:

 It sounds like a libc problem rather than a gcc problem.  Try
 #undefing SYSTEM_FNMATCH in sysdep.h and see if it works then.

 It's hard for me to imagine an fnmatch that ignores FNM_PATHNAME: I
 mean, don't most shells rely on this to handle file globbing and
 whatnot?

The conventional wisdom among free software of the 90s was that
fnmatch() was too buggy to be useful.  For that reason all free shells
rolled their own fnmatch, as did other programs that needed it,
including Wget.  Maybe the conventional wisdom was right for the
reporter's system.

Another possibility is that something else is installing fnmatch.h in
a directory on the compiler's search path and breaking the system
fnmatch.  IIRC Apache was a known culprit that installed fnmatch.h in
/usr/local/include.  That was another reason why Wget used to
completely ignore system-provided fnmatch.

In any case, it should be easy enough to isolate the problem:

#include stdio.h
#include fnmatch.h
int main()
{
  printf(%d\n, fnmatch(foo*, foo/bar, FNM_PATHNAME));
  return 0;
}

It should print a non-zero value.

Re: wget 1.11.1 make test fails

2008-04-02 Thread Hrvoje Niksic

Micah Cowan [EMAIL PROTECTED] writes:

 I'm wondering whether it might make sense to go back to completely
 ignoring the system-provided fnmatch?

One argument against that approach is that it increases code size on
systems that do correctly implement fnmatch, i.e. on most modern
Unixes that we are targeting.  Supporting I18N file names would
require modifications to our fnmatch; but on the other hand, we still
need it for Windows, so we'd have to make those changes anyway.

Providing added value in our fnmatch implementation should go a long
way towards preventing complaints of code bloat.

 In particular, it would probably resolve the remaining issue with
 that one bug you reported about fnmatch() failing on strings whose
 encoding didn't match the locale.

It would.

 Additionally, I've been toying with the idea of adding something
 like a ** to match all characters, including slashes.

That would be great.  That kind of thing is known to zsh users anyway,
and it's a useful feature.

Re: Wget 1.11 build fails on old Linux

2008-03-31 Thread Alain Guibert

 On Monday, February 25, 2008 at 16:32:21 +0100, Alain Guibert wrote:

 On an old Debian Bo system (kernel 2.0.40, gcc 2.7.2.1, libc 5.4.33),
 building Wget 1.11 fails:

While wget 1.11.1 builds and works OK. Thank you very much, gentlemen!


Alain.

Re: wget 1.11.1 make test fails

2008-03-31 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alain Guibert wrote:
 Hello,
 
 With an old gcc 2.7.2.1 compiler, wget 1.11.1 make test fails:
 
 | gcc -I. -I. -I./../src  -DHAVE_CONFIG_H 
 -DSYSTEM_WGETRC=\/usr/local/etc/wgetrc\ 
 -DLOCALEDIR=\/usr/local/share/locale\ -O2 -Wall -DTESTING -c ../src/test.c
 | ../src/test.c: In function `all_tests':
 | ../src/test.c:51: parse error before `const'

snip

 The attached make-test.patch seems to fix this.

Yeah; that's invalid C90 code; declaration following statement. I'll fix
that.

 However later the 3rd
 test fails:
 
 | ./unit-tests
 | RUNNING TEST test_parse_content_disposition...
 | PASSED
 |
 | RUNNING TEST test_subdir_p...
 | PASSED
 |
 | RUNNING TEST test_dir_matches_p...
 | test_dir_matches_p: wrong result
 | Tests run: 3
 | make[1]: *** [run-unit-tests] Error 1
 | make[1]: Leaving directory `/tmp/wget-1.11.1/tests'
 | make: *** [test] Error 2

That's an interesting failure. I wonder if it's one of the new cases I
just added...

In any case, it runs through fine for me. This suggests a difference in
behavior between your system fnmatch function and mine (since that
should be the only bit of external code that dir_matches_p relies on).

Pity the tests don't give much clue as to the specifics of what
failed... there are about 10 tests for test_dir_matches_p, any of which
could have caused the problem.

The whole testing thing needs some serious rework; which is my current
top priority, when I find time for it (GSoC is eating everything, right
now).

make test isn't actually expected to work completely, right now; some
of the .px tests are known to be broken/missing. They're basically
provided as-is. I thought about removing them for the official
package; maybe I should have.

But if I had, I'd still be blissfully unaware of this potential problem.

If you know how, and don't mind, could you try to isolate which part of
test_dir_matches_p is failing? Perhaps augmenting the error message to
spit the match-list and string arguments...

- --
Thanks,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH8S/v7M8hyUobTrERAhrPAJ9N+XqLeVP0NN9HkLxO162Zf2uJnACeMwUo
kew/FkMA2GljqWiPG6IC+zs=
=fQSH
-END PGP SIGNATURE-

Re: wget aborts when file exists

2008-03-19 Thread Aleksandar Radulovic


(I am not subscribed to the bug-list)


**
On 12.03.2008 at 20:36 Charles wrote:

On Wed, Mar 12, 2008 at 12:46 AM, Aleksandar
Radulovic
[EMAIL PROTECTED] wrote:

 (I am not subscribed to the bug-list)

Hello,

I use wget to retreive recurively images from
a site,
 which are randomly changed on a daily basis . I
wrote
 small batch which worked until sistem upgrade. Now
the
 new version of wget is installed but it aborts when
 any file already exists.

When I tried this in my wget, I got different behavior
with wget 1.11
alpha and wget 1.10.2

D:\wget --proxy=off -r -l 1 -nc -np
http://localhost/test/
File `localhost/test/index.html' already there; not
retrieving.


D:\wget110 --proxy=off -r -l 1 -nc -np
http://localhost/test/
File `localhost/test/index.html' already there; not
retrieving.

File `localhost/test/a.gif' already there; not
retrieving.

File `localhost/test/b.gif' already there; not
retrieving.

File `localhost/test/c.jpg' already there; not
retrieving.

FINISHED --20:31:41--
Downloaded: 0 bytes in 0 files

I think wget 1.10.2 behavior is more correct. Anyway
it did not abort
in my case.

---
Charles



It brokes and it is wget 1.10.2 . I really don't know
why, and I can't have influence in that because I am
not an administrator of the system, just an user.
However, it seems that this bug occures.

Aca





  

Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Re: wget aborts when file exists

2008-03-13 Thread Hrvoje Niksic

Charles [EMAIL PROTECTED] writes:

 On Thu, Mar 13, 2008 at 1:17 AM, Hrvoje Niksic [EMAIL PROTECTED] wrote:
   It assums, though, that the preexisting index.html corresponds to
   the one that you were trying to download; it's unclear to me how
   wise that is.

  That's what -nc does.  But the question is why it assumes that
  dependent files are also present.

 Because I repeated the command, and the files have all been downloaded
 before.

We know that, but Wget 1.11 doesn't seem to check it.  It only checks
index.html, but not the other dependent files.

Re: wget aborts when file exists

2008-03-12 Thread Charles

On Wed, Mar 12, 2008 at 12:46 AM, Aleksandar Radulovic
[EMAIL PROTECTED] wrote:

 (I am not subscribed to the bug-list)

Hello,

I use wget to retreive recurively images from a site,
 which are randomly changed on a daily basis . I wrote
 small batch which worked until sistem upgrade. Now the
 new version of wget is installed but it aborts when
 any file already exists.

When I tried this in my wget, I got different behavior with wget 1.11
alpha and wget 1.10.2

D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/
File `localhost/test/index.html' already there; not retrieving.


D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/
File `localhost/test/index.html' already there; not retrieving.

File `localhost/test/a.gif' already there; not retrieving.

File `localhost/test/b.gif' already there; not retrieving.

File `localhost/test/c.jpg' already there; not retrieving.

FINISHED --20:31:41--
Downloaded: 0 bytes in 0 files

I think wget 1.10.2 behavior is more correct. Anyway it did not abort
in my case.

---
Charles

Re: wget aborts when file exists

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Aleksandar Radulovic wrote:
 (I am not subscribed to the bug-list)
 
   Hello,
 
   I use wget to retreive recurively images from a site,
 which are randomly changed on a daily basis . I wrote
 small batch which worked until sistem upgrade. Now the
 new version of wget is installed but it aborts when
 any file already exists.
   I call it in the following way (I hid the URL and
 proxies):
 
 wget -np -nc -r -l 1  URL

I strongly suspect this issue does not appear in the current Wget
release (version 1.11). Please obtain a copy of that (or better yet, the
prerelease version at http://alpha.gnu.org/gnu/wget/, which fixes some
other issues), and see if you can reproduce it.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH1/5A7M8hyUobTrERAjmVAKCJ6P7eunjPaptm80rFc9si7lejWgCfQHoL
xTd0dLxEr2odzHcurg+5LqQ=
=VpCG
-END PGP SIGNATURE-

Re: wget aborts when file exists

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Charles wrote:
 On Wed, Mar 12, 2008 at 12:46 AM, Aleksandar Radulovic
 [EMAIL PROTECTED] wrote:
 (I am not subscribed to the bug-list)

Hello,

I use wget to retreive recurively images from a site,
 which are randomly changed on a daily basis . I wrote
 small batch which worked until sistem upgrade. Now the
 new version of wget is installed but it aborts when
 any file already exists.
 
 When I tried this in my wget, I got different behavior with wget 1.11
 alpha and wget 1.10.2
 
 D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/
 File `localhost/test/index.html' already there; not retrieving.
 
 
 D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/
 File `localhost/test/index.html' already there; not retrieving.
 
 File `localhost/test/a.gif' already there; not retrieving.
 
 File `localhost/test/b.gif' already there; not retrieving.
 
 File `localhost/test/c.jpg' already there; not retrieving.
 
 FINISHED --20:31:41--
 Downloaded: 0 bytes in 0 files
 
 I think wget 1.10.2 behavior is more correct. Anyway it did not abort
 in my case.

I think I like the 1.11 behavior (I'm assuming it's intentional). It
assums, though, that the preexisting index.html corresponds to the one
that you were trying to download; it's unclear to me how wise that is.
Hrvoje, are you aware of this change and its rationale?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH1/7B7M8hyUobTrERAgY+AJ4+uSRpDzUnmgiaWSanFGsFET/BRACfcGnT
eRcfOIAHhDvibRn0/EQiAB4=
=GH8N
-END PGP SIGNATURE-

Re: wget aborts when file exists

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Charles wrote:
 On Wed, Mar 12, 2008 at 11:03 PM, Micah Cowan [EMAIL PROTECTED] wrote:
  I think I like the 1.11 behavior (I'm assuming it's intentional). It
  assums, though, that the preexisting index.html corresponds to the one
  that you were trying to download; it's unclear to me how wise that is.
  Hrvoje, are you aware of this change and its rationale?
 
 Hi,
 
 One drawback of this behavior is that when we mirror a website, and
 then we cancel it, but the server does not provide last modified
 header (because the content is dynamically generated, for example), we
 cannot continue from the point we cancel the download (all the files
 will have to be downloaded again).

If you didn't want that, you probably shouldn't specify -nc, and should
instead specify -c.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH2A8x7M8hyUobTrERAvqZAJ9JGvX60DJBheqB/BjiEQh9KIRpPgCbBccX
bD/mUv5ee+dRxFXPBZtGE+o=
=7fvu
-END PGP SIGNATURE-

Re: wget aborts when file exists

2008-03-12 Thread Hrvoje Niksic

Micah Cowan [EMAIL PROTECTED] writes:

 When I tried this in my wget, I got different behavior with wget 1.11
 alpha and wget 1.10.2
 
 D:\wget --proxy=off -r -l 1 -nc -np http://localhost/test/
 File `localhost/test/index.html' already there; not retrieving.
 
 
 D:\wget110 --proxy=off -r -l 1 -nc -np http://localhost/test/
 File `localhost/test/index.html' already there; not retrieving.
 
 File `localhost/test/a.gif' already there; not retrieving.
 
 File `localhost/test/b.gif' already there; not retrieving.
 
 File `localhost/test/c.jpg' already there; not retrieving.
 
 FINISHED --20:31:41--
 Downloaded: 0 bytes in 0 files
 
 I think wget 1.10.2 behavior is more correct. Anyway it did not abort
 in my case.

 I think I like the 1.11 behavior (I'm assuming it's intentional).

Let me recap to see if I understand the difference.  From the above
output, it seems that 1.10's -r descended into an HTML even if it was
downloaded.  1.11's -r assumes that if an HTML file is already there,
then so are all the other files it references.

If this analysis is correct, I don't see the benefit of the new
behavior.  If index.html happens to be present, it doesn't mean that
the files it references are also present.  I don't know if the change
was intentional, but it looks incorrect to me.

 It assums, though, that the preexisting index.html corresponds to
 the one that you were trying to download; it's unclear to me how
 wise that is.

That's what -nc does.  But the question is why it assumes that
dependent files are also present.

Re: wget aborts when file exists