from:"Mauro Tortonesi"

Re: How to change the name of the output file

2007-12-06 Thread Mauro Tortonesi

On Wed, 05 Dec 2007 21:41:14 -0800
Micah Cowan [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Mauro Tortonesi wrote:
  On Tuesday 20 November 2007 20:38:13 Micah Cowan wrote:
  
  Be advised, though, that -O doesn't simply mean make the name of the 
  downloaded result `filename'; it means act as if you're redirecting 
  output 
  to a file named `filename'. In particular, this means that such things as
  timestamping, and multiple URLs, may not work as you expect.
  
  micah,
  
  i believe this text is a good candidate for inclusion in the man page.
 
 Heh, I turned your suggestion into a bug... and then, now that I've had
 a chance to take a closer look, I've discovered I already added some
 clarifying text to the -O option's description. How's this look?
 
 http://hg.addictivecode.org/wget/1.11/rev/5e5eae3f8d9f
 
 
 Use of `-O' is _not_ intended to mean simply use the name FILE instead
 of the one in the URL; rather, it is analogous to shell redirection:
 `wget -O file http://foo' is intended to work like`wget -O - http://foo
  file'; `file' will be truncated immediately, and _all_ downloaded
 content will be written there.
 
 Note that a combination with `-k' is only permitted when downloading a
 single document, and combination with any of `-r',`-p', or `-N' is not
 allowed.
 

looks perfect to me. hopefully, now we'll have less complaints from users who 
try -O for multiple downloads ;-)

-- 
Mauro Tortonesi [EMAIL PROTECTED]

Re: wget2

2007-11-30 Thread Mauro Tortonesi

On Friday 30 November 2007 14:48:07 David Ginger wrote:

  what do you think?

 Python.

i was asking what you guys think of my write a prototype using a dynamic 
language then incrementally rewrite everything in C proposal, and not trying 
to start yet another programming language flame war ;-)

i believe that for sheer application prototyping purposes, ruby and python are 
equally good. in addition, i know and like both of them. so, in case micah is 
actually evaluating ruby and python, i don't really care which one of them he 
will finally choose to adopt.

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi                          http://www.tortonesi.com

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Re: wget2

2007-11-30 Thread Mauro Tortonesi

On Friday 30 November 2007 11:59:45 Hrvoje Niksic wrote:
 Mauro Tortonesi [EMAIL PROTECTED] writes:
  I vote we stick with C. Java is slower and more prone to environmental
  problems.
 
  not really. because of its JIT compiler, Java is often as fast as
  C/C++, and sometimes even significantly faster.

 Not if you count startup time, which is crucial for a program like
 Wget.  Memory use is also incomparable.

right. i was not suggesting to implement wget2 in Java, anyway ;-) 

but we could definitely make good use of dynamic languages such as Ruby (my 
personal favorite) or Python, at least for rapid prototyping purposes. both 
Ruby and Python support event-driven I/O (http://rubyeventmachine.com for 
Ruby, and http://code.google.com/p/pyevent/ for Python) and asynch DNS 
(http://cares.rubyforge.org/ for Ruby and 
http://code.google.com/p/adns-python/ for Python) and both are relatively 
easy to interface with C code. 

writing a small prototype for wget2 in Ruby or Python at first, and then 
incrementally rewrite it in C would save us a lot of development time, IMVHO.

what do you think?

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi                          http://www.tortonesi.com

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Re: wget2

2007-11-30 Thread Mauro Tortonesi

On Friday 30 November 2007 03:29:05 Josh Williams wrote:
 On 11/29/07, Alan Thomas [EMAIL PROTECTED] wrote:
  Sorry for the misunderstanding.  Honestly, Java would be a great language
  for what wget does.  Lots of built-in support for web stuff.  However, I
  was kidding about that.  wget has a ton of great functionality, and I am
  a reformed C/C++ programmer (or a recent Java convert).  But I love using
  wget!

 I vote we stick with C. Java is slower and more prone to environmental
 problems.

not really. because of its JIT compiler, Java is often as fast as C/C++, and 
sometimes even significantly faster.

 Wget needs to be as independent as we can possibly make it. 
 A lot of the systems that wget is used on (including mine) do not even
 have Java installed. That would be a HUGE requirement for many people.

i totally agree.

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi                          http://www.tortonesi.com

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Re: How to change the name of the output file

2007-11-29 Thread Mauro Tortonesi

On Tuesday 20 November 2007 20:38:13 Micah Cowan wrote:

 Be advised, though, that -O doesn't simply mean make the name of the 
 downloaded result `filename'; it means act as if you're redirecting output 
 to a file named `filename'. In particular, this means that such things as
 timestamping, and multiple URLs, may not work as you expect.

micah,

i believe this text is a good candidate for inclusion in the man page.

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi                          http://www.tortonesi.com

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Re: .1, .2 before suffix rather than after

2007-11-29 Thread Mauro Tortonesi

On Sunday 04 November 2007 22:54:24 Hrvoje Niksic wrote:
 Micah Cowan [EMAIL PROTECTED] writes:
  Christian Roche has submitted a revised version of a patch to modify
  the unique-name-finding algorithm to generate names in the pattern
  foo-n.html rather than foo.html.n. The patch looks good, and
  will likely go in very soon.

 foo.html.n has the advantage of simplicity: you can tell at a glance
 that foo.n is a duplicate of foo.  Also, it is trivial to remove
 the unwanted files by removing foo.*.  Why change what worked so
 well in the past?

i totally agree with hrvoje here. also note that changing wget 
unique-name-finding algorithm can potentially break lots of wget-based 
scripts out there. i think we should leave these kind of changes for wget2 - 
or wget-on-steroids or however you want to call it ;-)

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi                          http://www.tortonesi.com

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

HEAD request logic summary

2007-08-17 Thread Mauro Tortonesi

hi to everybody,

here is a table resuming the behaviour of current wget version (soon to be 
1.11) and wget 1.10.2 regarding HTTP HEAD requests. i hope the table will be 
useful to determine whether the currently implemented logic is correct.

please notice that recently micah changed the behaviour of 
--no-content-disposition option, turning it on by default. that is, by default 
wget will not consider Content-Disposition header in HTTP resource retrieval.


  -N | --no-content-disposition | HTTP Content-Disposition | Preliminary HEAD | 
Preliminary HEAD  | Test name 
 |  | header presence  | request in 1.11  | 
request in 1.10.2 |
---
  no |   no |   no |  yes | 
   no | Test-noop  
  no |   no |  yes |  yes | 
   no | Test-HTTP-Content-Disposition
  no |  yes |   no |   no | 
  N/A | Test--no-content-disposition-trivial
  no |  yes |  yes |   no | 
  N/A | Test--no-content-disposition
 yes |   no |   no |  yes | 
   no | Test-N 
 yes |   no |  yes |  yes | 
   no | Test-N-HTTP-Content-Disposition
 yes |  yes |   no |   no | 
  N/A | Test-N--no-content-disposition-trivial
 yes |  yes |  yes |   no | 
  N/A | Test-N--no-content-disposition


  -O | --no-content-disposition | HTTP Content-Disposition | Preliminary HEAD | 
Preliminary HEAD  | Test name 
 |  | header presence  | request in 1.11  | 
request in 1.10.2 |
---
  no |   no |   no |  yes | 
   no | Test-noop  
  no |   no |  yes |  yes | 
   no | Test-HTTP-Content-Disposition
  no |  yes |   no |   no | 
  N/A | Test--no-content-disposition-trivial
  no |  yes |  yes |   no | 
  N/A | Test--no-content-disposition
 yes |   no |   no |  yes | 
   no | Test-O 
 yes |   no |  yes |  yes | 
   no | Test-O-HTTP-Content-Disposition
 yes |  yes |   no |   no | 
  N/A | Test-O--no-content-disposition-trivial
 yes |  yes |  yes |   no | 
  N/A | Test-O--no-content-disposition


 --spider |  -r | --no-content-disposition | HTTP Content-Disposition | 
Preliminary HEAD | Preliminary HEAD  | Test name 
  | |  | header presence  | request 
in 1.11  | request in 1.10.2 |
--|---
  yes |  no |   no |   no | 
 yes |   yes | Test--spider
  yes |  no |   no |  yes | 
 yes |   yes | Test--spider-HTTP-Content-Disposition
  yes |  no |  yes |   no | 
 yes |   N/A | Test--spider--no-content-disposition-trivial
  yes |  no |  yes |  yes | 
 yes |   N/A | Test--spider--no-content-disposition
  yes | yes |   no |   no | 
 yes |  N/A* | Test--spider-r 
  yes | yes |   no |  yes | 
 yes |  N/A* | Test--spider-r-HTTP-Content-Disposition
  yes | yes |  yes |   no | 
 yes |  N/A* | Test--spider-r--no-content-disposition-trivial
  yes | yes |  yes |  yes | 
 yes |  N/A* | Test--spider-r--no-content-disposition

*) recursive spider mode is broken in 1.10.2


-- 
Mauro Tortonesi [EMAIL PROTECTED]

Re: wget bug?

2007-07-09 Thread Mauro Tortonesi

On Mon, 9 Jul 2007 15:06:52 +1200
[EMAIL PROTECTED] wrote:

 wget under win2000/win XP
 I get No such file or directory error messages when using the follwing 
 command line.
 
 wget -s --save-headers 
 http://www.nndc.bnl.gov/ensdf/browseds.jsp?nuc=%1class=Arc;
 
 %1 = 212BI
 Any ideas?

hi nikolaus,

in windows, you're supposed to use %VARIABLE_NAME% for variable substitution. 
try using %1% instead of %1.

-- 
Mauro Tortonesi [EMAIL PROTECTED]

Re: New wget maintainer

2007-06-27 Thread Mauro Tortonesi

On Tue, 26 Jun 2007 13:33:35 -0700
Micah Cowan [EMAIL PROTECTED] wrote:

hi micah,

 The GNU Project has appointed me as the new maintainer for wget, to fill
 the shoes that Mauro Tortonesi is leaving. I am very excited to be able
 to take part in the development of such a terrific and useful tool. I've
 certainly found it very helpful on many occasions.

congratulations on your appointment as the new wget maintainer. i hope you'll 
have more time to dedicate to wget than i did so far, and i am sure you'll 
bring a lot of enthusiasm and new energies in the wget community.


 I have had the opportunity to go over most of the wget source code, and
 the last couple of years' worth of mailing list archives. This has given
 me a fairly good sense of where the project is, and where it could be
 going. I already have some ideas of some of the things I would like to
 see happen; many of them are already in the current TODO file. I've also
 assigned rough priorities (my own) to things I've seen in the TODO file,
 or bugs that have been reported on-list. Ideally, I'd like to start
 using a bug tracker to handle these; reading from the list, I know that
 this was Mauro's desire as well. Has consideration been given to using
 Savannah for this purpose?

yes, we definitely need a bug tracker.


 Being that we seem to be very close to a release, I do not want to make
 a bunch of sudden changes, either to current processes or to the current
 plans for the imminent release. However, there are a couple of small
 items that I feel should absolutely be resolved before 1.11 is released
 officially:
 
   - Wget should not be attempting basic authentication before it
 receives a challenge (which could be digest or what have you). This is a
 security issue.

i am not so sure this is a critical point. as hrvoje pointed out, basic 
authentication is definitely the most used authentication mechanism on the web, 
so changing the current policy to perform digest authentication first and use 
basic authentication as a failover might result in a perfomance penalty. in 
addition, both basic and digest authentication are meant to be used in https 
only. in fact, while digest authentication does not send the password in clear 
text over the wire, it certainly does not protect from MiM attacks.

wrt digest authentication, it would be nice to have it work for proxy 
connections as well. so far, wget supports only basic authentication for HTTP 
proxies (no NTLM authentication either).


   - There was a report to the mailing list that user:pass information
 was being sent in the Referer header. I didn't see any further activity
 on that thread, and haven't yet had the opportunity to confirm this; it
 may be an old, fixed issue. However, if it's true, I would consider this
 to be a show-stopper.

yes, we need to check that.

 
 I expect that both of these issues would require very small effort to
 resolve.

don't be so sure about it ;-)


 Also, GNU maintainers have been asked to move all packages to version 3
 of the GPL, which will be released on Friday the 29th. Ideally,
 maintainers have been asked to coincide releases with the license
 updates with the release of GPLv3; I don't think this is feasible in our
 case. Barring that, we have been asked to get such a release out by
 end-of-July. I'm not certain whether 1.11 will be ready in time; in that
 case, we could probably issue a 1.10.3 with only the licensing change.

IMVHO, the code in the trunk is ready to be released.


-- 
Mauro Tortonesi [EMAIL PROTECTED]

Re: Automate ul/dl using wget

2007-06-21 Thread Mauro Tortonesi

On Mon, 18 Jun 2007 10:13:48 -0700 (PDT)
Joe Kopra [EMAIL PROTECTED] wrote:

 Please forgive my ignorance if this question is misdirected, if you know a 
 better tool to do what I am attempting, please tell me.
 
 I am trying to upload a file from a unix script to a website (that is 
 interactive) and get the resulting .html back to the unix box.
 
 I am including a sample .mup file for use with the website and of course the 
 site itself.

 I believe IBM has found a vulnerability using their programmatic utility 
 and has such shut it down, so I am trying this as a workaround, please see 
 below:

 http://www14.software.ibm.com/webapp/set2/mds/fetch?page=mds.html

 see Upload a data file and About programmatic upload of survey files.

 .mup file included for testing purposes.
 
 If there is no solution with wget, please recommend anything you think might 
 help.

hi joe,

wget does not natively support multipart uploads at the moment. but you might 
be able to do what you need using this shell script:


#!/bin/bash

BOUNDARY=AaB03x

echo -n --$BOUNDARY\r\nContent-disposition: form-data; 
name=\mdsData\\r\nContent-Type: text/plain\r\n\r\n  tmpfile
cat $1  tmpfile
echo -n --$BOUNDARY\r\n  tmpfile

wget --header=Content-type: multipart/form-data, boundary=$BOUNDARY \
 --post-file=tmpfile \
 http://www14.software.ibm.com/webapp/set2/mds/mds

rm -f tmpfile

# end of script


the usage, of course, is:

sh scriptname filetoupload


let me know if this solved your problem. but, please, let's continue this 
conversation on the wget ml.

-- 
Mauro Tortonesi [EMAIL PROTECTED]

website updated

2007-01-09 Thread Mauro Tortonesi



hi to everybody,

i have just updated wget's website on sunsite.dk:

http://wget.sunsite.dk/

please take a look at it and tell me what you think about it. i suck at 
css, so the site graphics is very lean-and-mean (i would say actually 
inexistent). if any of you guys wants to work on a more attractive 
layout for the website, you're more than welcome to do it.


i am planning to rewrite the development page. the current one is just a 
placeholder. in particular, i have some ideas about a new feature 
wishlist section which i think could be very interesting for our users. 
i am also open to any kind of suggestions about changing current web 
pages or adding some new content to the website.


from now on, i'll do my best to keep the new website up-to-date.


in addition, i have recently installed a new bugzilla bug tracker for wget:

https://ds.ing.unife.it/bugzilla/

because of several technical problems, the previous bug tracker 
installation was never actually used. however, i expect the new bug 
tracker to be very useful in the bug fixing process.


please, notice that the adoption of the new bug tracker will not change 
the current bug reporting procedure from the users' perspective. wget 
users will keep sending bug reports via email at the address 
bug-wget_AT_gnu.org. the bug tracker will only be used by wget 
developers to keep track of reported bugs and of their status. since the 
information in the bug tracker will be accessible by everyone, our users 
will be able to better monitor the status of their bug reports as well.



as a bottom line, i know i have not been very active recently. i have 
had my share of problems with my job and, on top of that, i just moved 
to a new house. i expect to have more time to work on wget from now on. 
however, i realize i'll never have enough time to dedicate to wget in 
order to do a good job as a maintainer. for this reason, i intend to 
step down from my wget maintainer position. don't worry, i'll keep 
working on wget as a normal developer. so, this won't hurt the 
development of wget at all. on the contrary, i expect that my decision 
will significantly help in making the development wget more agile.


i've just informed the FSF about this. i am sure they will be able to 
find a skilled developer who has more time than i have to work on wget 
and is eager to face the challenge of being the new maintainer.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: problem with no-parent option

2007-01-05 Thread Mauro Tortonesi


Piotr Stankiewicz wrote:

Hello!

I'm using wget for windows version 1.10.2.

I'm trying to download the contents of my photography site. For doing that I
created the following command:

wget --wait
2 --random-wait -r -l7 -H -p --convert-links --html-extension -Dpbase.com --
exclude-domains forum.pbase.com,search.pbase.com --no-parent -e robots=off
http://www.pbase.com/piotrstankiewicz

(I had to use -H option as the photos are placed at other servers that
www.pbase.com)

Unfortunately wget seems to ignore --no-parent option as it starts to
download also www.pbase.com/index.html
www.pbase.com/help.hmtl
documents and others placed in the main directory. I have impression it's
some kind of bug, although I'm not definitely wget expert. Could you try to
verify it please?


hi piotr,

both the url you specified:

http://www.pbase.com/piotrstankiewicz

and the urls you don't want to retrieve:

http://www.pbase.com/help.html
http://www.pbase.com/index.html

reside in the same directory, so the --no-parent option can't help you.

you should probably try to append '/' to the first url:

wget --wait 2 --random-wait -r -l7 -H -p --convert-links 
--html-extension -Dpbase.com --exclude-domains 
forum.pbase.com,search.pbase.com --no-parent -e robots=off

http://www.pbase.com/piotrstankiewicz/

this command should work.


Additionnaly I tried to use the option -R to exclude those files. In such a
case wget downloads those files and deletes it after but it follows the
links from those files (which is unwated by me). I found the information
that it's by design. 


correct. in recursive mode wget retrieves undesired html files to parse 
them for other urls to download, and deletes them after parsing.


But what about introducing any other option precising if the links from 
the unwated documents (specified with -R) should be followed or no (in 
some cases it's not welcome).


i agree. users should be able to tell wget not to retrieve undesired 
html files at all.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Stripping unnecessary ../../ in relative links

2007-01-03 Thread Mauro Tortonesi


Sylvain wrote:

I forgot to add I was using
# wget --version
GNU Wget 1.10.2

and

# uname -sr
Linux 2.6.19.1

Wget has been compiled with ssl and nls support, that's all.


hi sylvain,

could you please try the current version of wget from our subversion 
repository:


http://www.gnu.org/software/wget/wgetdev.html#development

?

this bug should be fixed in the new code.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget problem

2007-01-03 Thread Mauro Tortonesi


[EMAIL PROTECTED] wrote:

Dear Sir,
 
  I have installed wget 1.10.2 into HP-UX 11.23 from http://hpux.cs.utah.edu/hppd/hpux/Gnu/wget-1.10.2/. Also I have installed the runtime dependency packages like libgcc, gettext, libiconv and openssl. However when I run this and get some testing web content. The following errors is prompted.
 
# wget http://10.1.1.15

--12:46:00--  http://10.1.1.15/
   = `index.html'
Connecting to 10.1.1.15:80... connected.
HTTP request sent, awaiting response... 200 OK
/usr/lib/hpux32/dld.so: Unsatisfied code symbol '__umodsi3' in load module 
'/usr/local/bin/wget'.
Killed

  Could you please help to tell me what's wrong on the issue? Thanks.


hi cheng,

i am not an expert of HP UX, but it seems you have a broken 
installation. are you sure you correctly installed all the required 
dependencies:


libgcc gettext libiconv openssl

(in particular libgcc)?

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wget - no files retained when 401/403 is received

2007-01-03 Thread Mauro Tortonesi


Chris Dunkle wrote:

Wget developers,

This may not be considered a bug, but it is unexpected behavior for me,
and thus I'm reporting it here. I'm using GNU Wget 1.10.2 with the
following options:

wget -r -x -p --save-headers 192.168.0.1

The web server requires a username and password for the default page,
and thus I receive a 401 Unauthorized response. I would expect that the
HTML data that was returned, including the HTTP headers, would be saved
in the file 192.168.0.1/index.html. But instead, there are no files
written, and the directory isn't even created. In most cases, it would
make sense that nobody would want to save this data, so I can understand
this behavior. But I would like to save whatever data is returned to me,
even if it may not be what I'm expecting. I'm receiving the same results
for a 403 Forbidden response, and this is probably the case for other
ones as well. The 401 outputs Authorization failed. and the 403
outputs xx:xx:xx ERROR 403: Forbidden. after execution.

Would this be considered a bug, or is this just an undocumented feature?
If it's not considered a bug, could a command line option be added that
saves the data from 4xx error code responses rather than just quitting?
Basically, if the connection is successful, and something is returned, I
want to keep it, no matter what it is.


hi chris,

wget currently does not save error messages. i am not sure if such a 
feature would be actually useful for our users, and i am not very keen 
on adding another very-rarely-used feature to wget.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Bug in 1.10.2 vs 1.9.1

2007-01-03 Thread Mauro Tortonesi


Juhana Sadeharju wrote:

Hello. Wget 1.10.2 has the following bug compared to version 1.9.1.
First, the bin/wgetdir is defined as
  wget -p -E -k --proxy=off -e robots=off --passive-ftp
  -o zlogwget`date +%Y%m%d%H%M%S` -r -l 0 -np -U Mozilla --tries=50
  --waitretry=10 $@

The download command is
  wgetdir http://udn.epicgames.com

Version 1.9.1 result: download ok
Version 1.10.2 result: only udn.epicgames.com/Main/WebHome downloaded
and other converted urls are of the form
  http://udn.epicgames.com/../Two/WebHome


hi juhana,

could you please try the current version of wget from our subversion 
repository:


http://www.gnu.org/software/wget/wgetdev.html#development

?

this bug should be fixed in the new code.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget problem

2007-01-03 Thread Mauro Tortonesi

[EMAIL PROTECTED] wrote:
 Dear Mauro,
  
   Yes we have installed those prerequsite package but still failed. 
 We have tried the PA-RISC depot and it works although we are using 
 Itanium platform. We have tried another development machine and
 the result is the same. So I suspect the depot information should
 be incorrect.

hi cheng,

do you have a compiler on your machine? maybe you should just try to
install wget from sources.

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wget in 1.11 beta 1 found

2006-12-27 Thread Mauro Tortonesi


[EMAIL PROTECTED] wrote:

Hi, dear developers!

  When using -P or --directory-prefix in v1.11 Beta 1 and later v1.11 Beta
1(with spider patch) command-line switches wget does not pay attention to 
neither of them.
It saves files in the current directory. Such incorrent behaviour appeares only 
if server http
answer contains Content-disposition tag. Wget v1.10.2 worked right.
  Hope, this bug won't live long :). 


hi denis,

could you please try the current wget version by downloading sources 
from our svn repository:


http://www.gnu.org/software/wget/wgetdev.html#development

?

i've just commited a patch that should fix this problem:

http://article.gmane.org/gmane.comp.web.wget.patches/1925


thanks,
mauro

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: read file name from HTTP Header

2006-12-01 Thread Mauro Tortonesi


Rares Vernica ha scritto:

Hi,

Is is possible for wget to read from the HTTP Headers the name of the 
file in which to write the output?


For example:
  wget -d http://something.com/A
...
---response begin---
HTTP/1.0 200 OK
...
Content-Type: something/something; name=B
Content-disposition: attachment; filename=B
...
---response end---
200 OK

It will save the downloaded content into file A. I would prefer that the 
content is saved in file B as specified in Content-Type or 
Content-disposition.


the current version version of wget has Content-disposition support.

you might want to try it:

http://www.gnu.org/software/wget/wgetdev.html#development

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: -P ignored by parse_content_disposition

2006-09-15 Thread Mauro Tortonesi


Ashley Bone ha scritto:

When wget determines the local filename from Content-Disposition,
the -P (--directory-prefix) is ignored.  The file is always
downloaded to the current directory.  Looking at 
parse_content_disposition(),
I think this may be by design.  Does anyone know for sure?  


no, it's clearly a bug.


If not, I can submit a patch.


yes, please do it if you can.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Feature suggestion: change detection for wget -c

2006-09-15 Thread Mauro Tortonesi


John McCabe-Dansted ha scritto:

Wget has no way of verifying that the local file is
  really a valid prefix of the remote file

Couldn't wget redownload the last 4 bytes (or so) of the file?

For a few bytes per file we could detect changes to almost all
compressed files and the majority of uncompressed files.


reliable detection of changes in the resource to be downloaded would be 
a very interesting feature. but do you really think that checking the 
last X ( 100) bytes would be enough to be reasonably sure the resource 
was (not) modified? what about resources which are updated by appending 
information, such as log files?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: timestamp and backup

2006-09-15 Thread Mauro Tortonesi


Olav Mørkrid ha scritto:

hi

let's say i fetch 10 files from a server with wget.

then i want to download any modifications to these files.

HOWEVER, if a new version of a file is downloaded, i want a backup of
the old file (eg. write to filename.bak, or possibly filename.001
and .002 to keep a record of all versions of a file.

can wget do this?


yes. if file X is already present in your filesystem, by default wget 
downloads the new file and saves it as X.1.



i tried to combine -N with -nc, which would seem logical (do timestamp
checking, and prevent overwriting), but wget protests that they are
mutually exclusive.

and if i use no options, then wget fetches a new file even though it's
not updated.


you should not use -nc, just -N.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 beta 1 released

2006-09-15 Thread Mauro Tortonesi


Oliver Schulze L. ha scritto:

Does this version have the conection cache code?


no, not yet. i have some preliminary code for connection caching, but i 
am not going to finish it and merge it into the trunk before wget 1.11 
is released.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: REST - error for files bigger than 4GB

2006-09-15 Thread Mauro Tortonesi


Steven M. Schweda ha scritto:


   Are you certain that the FTP _server_ can handle file offsets greater
than 4GB in the REST command?


i agree with steven here. it's very likely to be a server-side problem.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: one more thing.

2006-09-15 Thread Mauro Tortonesi


Tate Mitchell ha scritto:

If anyone could show me how to do this on the wget gui, that would be
appreciated, too.

http://www.jensroesner.de/wgetgui/


wget and wgetgui are releated programs, but they are developed by two 
different teams. you should ask this question to the wgetgui authors.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: help downloading site

2006-09-15 Thread Mauro Tortonesi


Tate Mitchell ha scritto:


Would it be possible to download each lesson individually, so that as
lessons are added, or finished, I can download them w/out re-downloading 
the whole site? Could someone tell me how please? Or would it be possible to

download the whole thing and just re-download parts that have been added
since the previous download?


why don't you try something like:

wget -m -k -np 
http://www.ncsu.edu/project/hindi_lessons/Hindi.Less.01/index.html


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget: ignores Content-Disposition header

2006-09-15 Thread Mauro Tortonesi


Jochen Roderburg ha scritto:

Noèl Köthe schrieb:

Hello,

I can reproduce the following with 1.10.2 and 1.11.beta1:

Wget ignores Content-Disposition header described in RFC 2616,
19.5.1 Content-Disposition.

an example URL is:

http://bugs.debian.org/cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715;msg=5;att=1 


Sorry, I don't see any Content-Disposition header in this example URL  ;-)

Result of a HEAD request:

200 OK
Connection: close
Date: Fri, 15 Sep 2006 12:58:14 GMT
Server: Apache/1.3.33 (Debian GNU/Linux)
Content-Type: text/html; charset=utf-8
Last-Modified: Mon, 04 Aug 2003 21:18:10 GMT
Client-Date: Fri, 15 Sep 2006 12:58:14 GMT
Client-Response-Num: 1


My own experience is that the 1.11 alpha/beta versions (where this 
feature was introduced) worked fine with the examples I encountered.


Jochen is right:

[EMAIL PROTECTED]:~/tmp$ LANG=C ~/code/svn/wget/src/wget -S -d 
http://bugs.debian.org/cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715;msg=5;att=1

DEBUG output created by Wget 1.10+devel on linux-gnu.

--16:58:52-- 
http://bugs.debian.org/cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715

Resolving bugs.debian.org... 140.211.166.43
Caching bugs.debian.org = 140.211.166.43
Connecting to bugs.debian.org|140.211.166.43|:80... connected.
Created socket 3.
Releasing 0x00556550 (new refcount 1).

---request begin---
GET /cgi-bin/bugreport.cgi/%252Ftmp%252Fupdate-grub.patch?bug=168715 
HTTP/1.0

User-Agent: Wget/1.10+devel
Accept: */*
Host: bugs.debian.org
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.0 200 OK
Date: Fri, 15 Sep 2006 14:54:55 GMT
Content-Type: text/html; charset=utf-8
Server: Apache/1.3.33 (Debian GNU/Linux)
Via: 1.1 proxy (NetCache NetApp/5.6.2R1)

---response end---

  HTTP/1.0 200 OK
  Date: Fri, 15 Sep 2006 14:54:55 GMT
  Content-Type: text/html; charset=utf-8
  Server: Apache/1.3.33 (Debian GNU/Linux)
  Via: 1.1 proxy (NetCache NetApp/5.6.2R1)
Length: unspecified [text/html]
Saving to: `%2Ftmp%2Fupdate-grub.patch?bug=168715'

[= 
  ] 20,018  32.6K/s 
  in 0.6s


Closed fd 3
16:58:54 (32.6 KB/s) - `%2Ftmp%2Fupdate-grub.patch?bug=168715' saved [20018]


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: --html-extension and --convert-links don't work together

2006-09-15 Thread Mauro Tortonesi


Ryan Barrett ha scritto:

hi wget developers! nicolas mizel reported a bug with --html-extension and
--convert-links about a year and a half ago. in a nutshell, 
--html-extension

appends .html to non-html filenames, but --converted-links doesn't use the
.html filenames when it converts links.

http://www.mail-archive.com/wget@sunsite.dk/msg07688.html

he reported it against 1.9.1, but it's still broken in 1.10.2. any 
chance it could be fixed in the next release?


in my opinion, this is a serious bug. we should fix it ASAP.

i have a lot on my plate right now, but if it'd help, i could probably 
whip up a patch in a few weeks or so...


that would be great. thanks.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Bug

2006-09-15 Thread Mauro Tortonesi


Reece ha scritto:

Found a bug (sort of).

When trying to get all the images in the directory below:
http://www.netstate.com/states/maps/images/

It gives 403 Forbidden errors for most of the images even after
setting the agent string to firefox's, and setting -e robots=off

After a packet capture, it appears that the site will give the
forbidden error if the Refferer is not exaclty correct.  However,
since wget actually uses the domain www.netstate.com:80 instead of
without the port, it screws it all up.  I've been unable to find any
way to tell wget not to insert the port in the requesting url and
referrer url.

Here is the full command I was using:

wget -r -l 1 -H -U Mozilla/4.0 (compatible; MSIE 5.01; Windows NT
5.0) -e robots=off -d -nh http://www.netstate.com/states/maps/images/


hi reece,

that's an interesting bug. i've just added it to my THINGS TO FIX list.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 beta 1 released

2006-08-28 Thread Mauro Tortonesi


Noèl Köthe ha scritto:

Am Dienstag, den 22.08.2006, 17:00 +0200 schrieb Mauro Tortonesi:

Hello, Mauro,


i've just released wget 1.11 beta 1:


Thanks.:)


you're very welcome to try it and report every bug you might encounter.


...
/usr/bin/make install 
DESTDIR=/home/nk/debian/wget/wget-experimental/wget-1.10.2+1.11.beta1/debian/wget
make[1]: Entering directory 
`/home/nk/debian/wget/wget-experimental/wget-1.10.2+1.11.beta1'
cd src  /usr/bin/make CC='gcc' CPPFLAGS='' DEFS='-DHAVE_CONFIG_H 
-DSYSTEM_WGETRC=\/etc/wgetrc\ -DLOCALEDIR=\/usr/share/locale\' 
CFLAGS='-D_FILE_OFFSET_BITS=64 -g -Wall' LDFLAGS='' LIBS='-ldl -lrt  -lssl -lcrypto ' DESTDIR='' 
prefix='/usr' exec_prefix='/usr' bindir='/usr/bin' infodir='/usr/share/info' mandir='/usr/share/man' 
manext='1' install.bin
make[2]: Entering directory 
`/home/nk/debian/wget/wget-experimental/wget-1.10.2+1.11.beta1/src'
../mkinstalldirs /usr/bin
/usr/bin/install -c wget /usr/bin/wget
...

I set DESTDIR in line 1 to install it somewhere but in line 3 DESTDIR=''

The problem should be fixed by this:

--- Makefile.in.orig2006-08-25 19:53:41.0 +0200
+++ Makefile.in 2006-08-25 19:53:55.0 +0200
@@ -77,7 +77,7 @@
 # flags passed to recursive makes in subdirectories
 MAKEDEFS = CC='$(CC)' CPPFLAGS='$(CPPFLAGS)' DEFS='$(DEFS)' \
 CFLAGS='$(CFLAGS)' LDFLAGS='$(LDFLAGS)' LIBS='$(LIBS)' \
-DESTDIR='$(DESTDIR=)' prefix='$(prefix)' exec_prefix='$(exec_prefix)' \
+DESTDIR='$(DESTDIR)' prefix='$(prefix)' exec_prefix='$(exec_prefix)' \
 bindir='$(bindir)' infodir='$(infodir)' mandir='$(mandir)' \
 manext='$(manext)'


Fixed, thanks.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 alpha1 [Fwd: Bug#378691: wget --continue doesn't workwith HTTP]

2006-08-28 Thread Mauro Tortonesi


Jochen Roderburg wrote:


I have now tested the new wget 1.11 beta1 on my Linux system and the above issue
is solved now. The Remote file is newer message now only appears when the
local file exists and most of the other logic with time-stamping and
file-naming works like expected.


excellent.


I meanwhile found, however, another new problem with time-stamping, which mainly
occurs in connection with a proxy-cache, I will report that in a new thread.
Same for a small problem with the SSL configuration.


thank you very much for the useful bug reports you keep sending us ;-)

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Failing assertion in Wget 2187

2006-08-28 Thread Mauro Tortonesi


Stefan Melbinger ha scritto:

Hello everyone,

I'm having troubles with the newest trunk version of wget (revision 2187).

Command-line arguments:

wget
  --recursive
  --spider
  --no-parent
  --no-directories
  --follow-ftp
  --retr-symlinks
  --no-verbose
  --level='2'
  --span-hosts
  --domains='www.example.com,a.example.com,b.example.com'
  --user-agent='Example'
  --output-file='example.log'
  'www.euroskop.cz'

Results in:

wget: url.c:1934: getchar_from_escaped_string: Assertion `str  *str' 
failed.

Aborted

Can somebody reproduce this problem? Am I using illegal combinations of 
arguments? Any ideas?


(Worked before the newest patch.)


it's really weird. with this command:

wget -d --verbose --recursive --spider --no-parent --no-directories 
--follow-ftp --retr-symlinks --level='2' --span-hosts 
--user-agent='Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101' 
--domains='www.example.com,a.example.com,b.example.com' 
http://www.euroskop.cz/


i get:

---response begin---
HTTP/1.0 200 OK
Date: Mon, 28 Aug 2006 14:35:14 GMT
Content-Type: text/html
Expires: Mon, 28 Aug 2006 14:35:14 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, 
pre-check=0

Server: Apache/1.3.26 (Unix) Debian GNU/Linux CSacek/2.1.9 PHP/4.1.2
X-Powered-By: PHP/4.1.2
Pragma: no-cache
Set-Cookie: PHPSESSID=b8af8e220f5f1f7321b86ce0524f88b2; expires=Tue, 
29-Aug-06 14:35:14 GMT; path=/

Via: 1.1 proxy (NetCache NetApp/5.6.2R1)

---response end---
200 OK

Stored cookie www.euroskop.cz -1 (ANY) / permanent insecure [expiry 
2006-08-29 16:35:14] PHPSESSID b8af8e220f5f1f7321b86ce0524f88b2

Length: unspecified [text/html]
Closed fd 3
200 OK

index.html: No such file or directory

FINISHED --16:37:42--
Downloaded: 0 bytes in 0 files


it seems there is a weird interaction between cookies and the recursive 
spider algorithm that makes wget bail out. i'll have to investigate this.




PS: Just FYI, when I compile I get the following warnings:

http.c: In function `http_loop':
http.c:2425: warning: implicit declaration of function `nonexisting_url'

main.c: In function `main':
main.c:1009: warning: implicit declaration of function `print_broken_links'

recur.c: In function `retrieve_tree':
recur.c:279: warning: implicit declaration of function `visited_url'


fixed, thanks.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 beta 1 released

2006-08-28 Thread Mauro Tortonesi


Christopher G. Lewis ha scritto:

I've updated the Windows binaries to include Beta 1, and included a
binary with Beta 1 + today's patch 2186  2187 for spider recursive
mode.

Available here: http://www.ChristopherLewis.com\wget 


thank you very much, chris. you're doing an awesome work.


And sorry to those who have been having some problems downloading the
ZIPs from my site.  I had some weird IIS gzip compression issues.


we should plan to move the win32 binaries page to wget.sunsite.dk 
immediately after the 1.11 release. what do you think?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: DNS through proxy with wget

2006-08-28 Thread Mauro Tortonesi


Karr, David ha scritto:

Inside our firewall, we can't do simple DNS lookups for hostnames
outside of our firewall.  However, I can write a Java program that uses
commons-httpclient, specifying the proxy credentials, and my URL
referencing an external host name will connect to that host perfectly
fine, obviously resolving the DNS name under the covers.

If I then use wget to do a similar request, even if I specify the proxy
credentials, it fails to find the host.  If I instead plug in the IP
address instead of the hostname, it works fine.

I noticed that the command-line options for wget allow me to specify the
proxy user and password, but they don't have a way to specify the proxy
host and port.


right. you have to specify the hostname/IP address and port of your 
proxy in your .wgetrc, or by means of the -e option:


wget -e 'http_proxy = http://yourproxy:8080/' --proxy-user=user 
--proxy-password=password -Y on http://someurl.com


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Failing assertion in Wget 2187

2006-08-28 Thread Mauro Tortonesi


Stefan Melbinger ha scritto:
By the way, as you might have noticed I wanted to exchange the real 
domain names with example.com, but forgot to exchange the last argument. :)


--domains='www.example.com,a.example.com,b.example.com'
--user-agent='Example'
--output-file='example.log'
'www.euroskop.cz'

So, just for the record, the real --domains value was 
'www.euroskop.cz,www2.euroskop.cz,rozcestnik.euroskop.cz'.


thanks.


In this case, that doesn't change the output, tho.


right.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

wget 1.11 beta 1 released

2006-08-22 Thread Mauro Tortonesi



hi to everybody,

i've just released wget 1.11 beta 1:

ftp://alpha.gnu.org/pub/pub/gnu/wget/wget-1.11-beta-1.tar.gz

you're very welcome to try it and report every bug you might encounter.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 alpha1 [Fwd: Bug#378691: wget --continue doesn't workwith HTTP]

2006-08-21 Thread Mauro Tortonesi


Jochen Roderburg ha scritto:

Zitat von Jochen Roderburg [EMAIL PROTECTED]:


Zitat von Hrvoje Niksic [EMAIL PROTECTED]:


Mauro, you will need to look at this one.  Part of the problem is that
Wget decides to save to index.html.1 although -c is in use.  That is
solved with the patch attached below.  But the other part is that
hstat.local_file is a NULL pointer when
stat(hstat.local_file, st) is used to determine whether the file
already exists in the -c case.  That seems to be a result of your
changes to the code -- previously, hstat.local_file would get
initialied in http_loop.


This looks as if if could also be the cause for the problems which I reported
some weeks ago for the timestamping mode
(http://www.mail-archive.com/wget@sunsite.dk/msg09083.html)




Hello Mauro,

The timestamping issues I reported in above mentioned message are now also
repaired by the patch you mailed last week here.
Only the small *cosmetic* issue remains that it *always* says:
   Remote file is newer, retrieving.
even if there is no local file yet.


hi jochen,

i have been working on the problem you reported for the last couple of days. 
i've just committed a patch that should fix it for good. could you please try 
the new HTTP code and tell me if it works properly?


thank you very much for your help.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wget 1.10.2 hangs

2006-08-21 Thread Mauro Tortonesi


Jonathan Abrahams ha scritto:


Any idea why this happens?


hi jonathan,

unfortunately i don't have a working cygwin environment at this time, so i won't 
be able to find out by myself what's the problem. maybe you can provide us some 
debug output by turning on the -d command-line option?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: create single directory

2006-08-21 Thread Mauro Tortonesi


Kim Grillo ha scritto:

Does anyone know if its possible to create a single directory with the URL
name instead of a directory tree when using wget?  For example, I dont want
to have to move through each directory to get to the file, I'd like the file
to be in a folder under a directory named after the URL.  I also dont want
to do a recursive wget.


you'll have to use a shell script to do that. for instance, something 
like this might work:


#!/bin/sh
for i
do
dirname=`echo $i | tr \ _`
mkdir $i
cd $i
wget -nd $i
cd ..
done

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Exit code

2006-08-21 Thread Mauro Tortonesi


Gerard Seibert ha scritto:
I wrote a script that downloads new 'dat' files for my AV program. I am using 
the '-N' option to only download a newer version of the file. What I need is 
for 'wget' to issue an exit code which would indicate whether a newer file 
was downloaded or not. Presently I have the script comparing the time of the 
existing file and then the time of the file after 'wget' has finished 
running. It would be simpler if 'wget' simply issued an exit code.


I have tried various methods but have not been successful in capturing one if 
it does actually issue it. Perhaps someone might have some further 
information on this?


hi gerard,

unfortunately at the moment wget does not define a specific list of exit 
values according to program exit states. that's a major problem we'll 
have to fix in the next 1.12 release.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 alpha1 [Fwd: Bug#378691: wget --continue doesn't workwith HTTP]

2006-08-17 Thread Mauro Tortonesi


Hrvoje Niksic ha scritto:

Noèl Köthe [EMAIL PROTECTED] writes:



a wget -c problem report with the 1.11 alpha 1 version
(http://bugs.debian.org/378691):

I can reproduce the problem. If I have already 1 MB downloaded wget -c
doesn't continue. Instead it starts to download again:



Mauro, you will need to look at this one.  Part of the problem is that
Wget decides to save to index.html.1 although -c is in use.  That is
solved with the patch attached below.  But the other part is that
hstat.local_file is a NULL pointer when
stat(hstat.local_file, st) is used to determine whether the file
already exists in the -c case.  That seems to be a result of your
changes to the code -- previously, hstat.local_file would get
initialied in http_loop.

The partial patch follows:

Index: src/http.c
===
--- src/http.c  (revision 2178)
+++ src/http.c  (working copy)
@@ -1762,7 +1762,7 @@
 
   return RETROK;

 }
-  else
+  else if (!ALLOW_CLOBBER)
 {
   char *unique = unique_name (hs-local_file, true);
   if (unique != hs-local_file)


you're right, of course. the patch included in attachment should fix the 
problem. since the new HTTP code supports Content-Disposition and delays the 
decision of the destination filename until it receives the response header, the 
best solution i could find to make -c work is to send a HEAD request to 
determine the actual destination filename before resuming download if -c is given.


please, let me know what you think.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it
Index: http.c
===
--- http.c  (revisione 2178)
+++ http.c  (copia locale)
@@ -1762,7 +1762,7 @@
 
   return RETROK;
 }
-  else
+  else if (!ALLOW_CLOBBER)
 {
   char *unique = unique_name (hs-local_file, true);
   if (unique != hs-local_file)
@@ -2231,6 +2231,7 @@
 {
   int count;
   bool got_head = false; /* used for time-stamping */
+  bool got_name = false;
   char *tms;
   const char *tmrate;
   uerr_t err, ret = TRYLIMEXC;
@@ -2264,7 +2265,10 @@
   hstat.referer = referer;
 
   if (opt.output_document)
+{
 hstat.local_file = xstrdup (opt.output_document);
+  got_name = true;
+}
 
   /* Reset the counter. */
   count = 0;
@@ -2309,13 +2313,16 @@
   /* Default document type is empty.  However, if spider mode is
  on or time-stamping is employed, HEAD_ONLY commands is
  encoded within *dt.  */
-  if ((opt.spider  !opt.recursive) || (opt.timestamping  !got_head))
+  if ((opt.spider  !opt.recursive) 
+  || (opt.timestamping  !got_head)
+  || (opt.always_rest  !got_name))
 *dt |= HEAD_ONLY;
   else
 *dt = ~HEAD_ONLY;
 
   /* Decide whether or not to restart.  */
   if (opt.always_rest
+   got_name
stat (hstat.local_file, st) == 0
S_ISREG (st.st_mode))
 /* When -c is used, continue from on-disk size.  (Can't use
@@ -2484,6 +2491,12 @@
   continue;
 }
   
+  if (opt.always_rest  !got_name)
+{
+  got_name = true;
+  continue;
+}
+  
   if ((tmr != (time_t) (-1))
(!opt.spider || opt.recursive)
((hstat.len == hstat.contlen) ||
Index: ChangeLog
===
--- ChangeLog   (revisione 2178)
+++ ChangeLog   (copia locale)
@@ -1,3 +1,9 @@
+2006-08-16  Mauro Tortonesi  [EMAIL PROTECTED]
+
+   * http.c: Fixed bug which broke --continue feature. Now if -c is
+   given, http_loop sends a HEAD request to find out the destination
+   filename before resuming download.
+
 2006-08-08  Hrvoje Niksic  [EMAIL PROTECTED]
 
* utils.c (datetime_str): Avoid code repetition with time_str.

Re: wget 1.11 alpha1 [Fwd: Bug#378691: wget --continue doesn't workwith HTTP]

2006-08-17 Thread Mauro Tortonesi


Hrvoje Niksic ha scritto:

Mauro Tortonesi [EMAIL PROTECTED] writes:



you're right, of course. the patch included in attachment should fix
the problem. since the new HTTP code supports Content-Disposition
and delays the decision of the destination filename until it
receives the response header, the best solution i could find to make
-c work is to send a HEAD request to determine the actual
destination filename before resuming download if -c is given.

please, let me know what you think.


I don't like the additional HEAD request, but I can't think of a
better solution.


same for me. in order to avoid the overhead of the extra HEAD request, i had 
considered disabling Content-Disposition and using url_file_name to determine 
the destination filename in case -c is given. but i really didn't like that 
solution.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: RES: BUG

2006-08-16 Thread Mauro Tortonesi


Junior + Suporte ha scritto:

Dear Mauro,

Follow the -S output for my command... this user and password is just a test
account, no problems with obfuscation...



C:\Documents and Settings\Luiz Carlos\Desktopwget -S
http://www.tramauniversit
ario.com.br/tuv2/participe/login.jsp?rd=http://www.tramauniversitario.com.br
/tuv
2/enquete/cb/sul/arte.jsp[EMAIL PROTECTED]pass=123qweSubmit
.x=6
Submit.y=1
--12:06:46--
http://www.tramauniversitario.com.br/tuv2/participe/login.jsp?rd=h
ttp://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jspusername=80
2400
[EMAIL PROTECTED]pass=123qweSubmit.x=6Submit.y=1
   =
[EMAIL PROTECTED]
enquete%2Fcb%2Fsul%2Farte.jsp[EMAIL PROTECTED]pass=123qweSu
bmit
.x=6Submit.y=1'
Resolving www.tramauniversitario.com.br... 200.177.252.35, 200.177.252.36
Connecting to www.tramauniversitario.com.br|200.177.252.35|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 302 Moved Temporarily
  Date: Tue, 11 Jul 2006 15:06:48 GMT
  Server: Apache/2.0.54 (Unix) mod_jk/1.2.14
  X-Powered-By: JSP/2.0
  Set-Cookie: JSESSIONID=F620EF2BED01FE4FD3900E05DB5A2B24; Path=/tuv2
  Set-Cookie: tu=661541|[EMAIL PROTECTED]; Expires=Fri, 22-Oct-2055
15:06:4
8 GMT; Path=
  Location:
http://www.tramauniversitario.com.br/servlet/login.jsp?username=8024
00391%40terra.com.brpass=123qwerd=http%3A%2F%2Fwww.tramauniversitario.com.
br%2
Ftuv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp
  Content-Length: 0
  Keep-Alive: timeout=15, max=100
  Connection: Keep-Alive
  Content-Type: text/html;charset=ISO-8859-1
Error in Set-Cookie, field `Path'Syntax error in Set-Cookie:
tu=661541|802400391
@TERRA.COM.BR; Expires=Fri, 22-Oct-2055 15:06:48 GMT; Path= at position 78.
Location:
http://www.tramauniversitario.com.br/servlet/login.jsp?username=802400
391%40terra.com.brpass=123qwerd=http%3A%2F%2Fwww.tramauniversitario.com.br
%2Ft
uv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp [following]
--12:06:47--
http://www.tramauniversitario.com.br/servlet/login.jsp?username=80
2400391%40terra.com.brpass=123qwerd=http%3A%2F%2Fwww.tramauniversitario.co
m.br
%2Ftuv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp
   =
[EMAIL PROTECTED]@terra.com.brpass=123qwerd=http%3A%
2F%2Fwww.tramauniversitario.com.br%2Ftuv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp'
Reusing existing connection to www.tramauniversitario.com.br:80.
HTTP request sent, awaiting response...
  HTTP/1.1 302 Moved Temporarily
  Date: Tue, 11 Jul 2006 15:06:48 GMT
  Server: Apache/2.0.54 (Unix) mod_jk/1.2.14
  X-Powered-By: JSP/2.0
  Set-Cookie: JSESSIONID=F52D0A41E21B23C4CAE45AD3461A5817; Path=/servlet
  Location:
http://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jsp
  Content-Length: 0
  Keep-Alive: timeout=15, max=99
  Connection: Keep-Alive
  Content-Type: text/html;charset=ISO-8859-1
Location: http://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jsp
[fol
lowing]
--12:06:48--
http://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jsp
   = `arte.jsp'
Reusing existing connection to www.tramauniversitario.com.br:80.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Tue, 11 Jul 2006 15:06:49 GMT
  Server: Apache/2.0.54 (Unix) mod_jk/1.2.14
  X-Powered-By: JSP/2.0
  Set-Cookie: JSESSIONID=1E3B1DF7F0C37BCDA33995A5E39AD0C4; Path=/tuv2
  Connection: close
  Content-Type: text/html;charset=ISO-8859-1
Length: unspecified [text/html]

[ = ] 3,416 --.--K/s

12:06:49 (47.32 MB/s) - `arte.jsp' saved [3416]

Luiz Carlos Zancanella Junior

-Mensagem original-
De: Mauro Tortonesi [mailto:[EMAIL PROTECTED] 
Enviada em: segunda-feira, 10 de julho de 2006 07:04

Para: Tony Lewis
Cc: 'Junior + Suporte'; [EMAIL PROTECTED]
Assunto: Re: BUG

Tony Lewis ha scritto:


Run the command with -d and post the output here.


in this case, -S can provide more useful information than -d. be careful to 
  obfuscate passwords, though!!!


hi junior,

unfortunately i can't reproduce the bug. here's what i get:


[EMAIL PROTECTED]:~/tmp/wgettest$ wget 
'http://www.tramauniversitario.com.br/tuv2/participe/login.jsp?rd=http://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jsp[EMAIL PROTECTED]pass=123qweSubmit.x=6Submit.y=1'
--15:56:34-- 
http://www.tramauniversitario.com.br/tuv2/participe/login.jsp?rd=http://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jsp[EMAIL PROTECTED]pass=123qweSubmit.x=6Submit.y=1
   = 
`login.jsp?rd=http:%2F%2Fwww.tramauniversitario.com.br%2Ftuv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp[EMAIL PROTECTED]pass=123qweSubmit.x=6Submit.y=1'
Risoluzione di www.tramauniversitario.com.br in corso... 200.177.252.35, 
200.177.252.36

Connessione a www.tramauniversitario.com.br|200.177.252.35:80... connesso.
HTTP richiesta inviata, aspetto la risposta... 404 /participe/login.jsp
15:56:35 ERRORE 404: /participe/login.jsp.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http

concurrent use of -O and -N options

2006-08-16 Thread Mauro Tortonesi



as some of you have noticed, i've recently disabled the use of -N in 
combination with -O in soon-to-be-released wget 1.11. in fact, not only 
concurrent use -O and -N has been broken since the dawn of time, but i 
believe it breaks the principle of least suprise and i don't think it is 
widely used.


let me clarify once again that the semantics of -O are intentionally 
similar to a unix shell output redirection. they were not meant to 
specify a custom naming pattern for downloaded resources (future 
versions of wget will likely have a dedicated command-line option for 
this). in this context, i believe that allowing -N to be used w/ -O 
could be very confusing.


Louis Gosselin (included in CC) asked me to reconsider my decision, as 
he believes the concurrent use of -O and -N options is actually very 
helpful. so, before i cross the point of no return and deprecate 
concurrent use of -O and -N, i would like to hear your opinions.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 alpha1 [Fwd: Bug#378691: wget --continue doesn't workwith HTTP]

2006-08-09 Thread Mauro Tortonesi


Hrvoje Niksic wrote:

Noèl Köthe [EMAIL PROTECTED] writes:



a wget -c problem report with the 1.11 alpha 1 version
(http://bugs.debian.org/378691):

I can reproduce the problem. If I have already 1 MB downloaded wget -c
doesn't continue. Instead it starts to download again:


Mauro, you will need to look at this one. 


i surely will. unfortunately, at the moment i am attending the winsys 
2006 research conference:


http://www.winsys.org

i'll take a look at the problem as soon as i get back to italy.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: WGet -O and -N timestamp options don't work together

2006-07-24 Thread Mauro Tortonesi


Louis Gosselin wrote:


WGet version 1.8.1

wget -N http://host/file.html

WGet only downloads and overwrites the ./file.html if the local file is 
older than the http copy.

As expected.


wget -O localfile.html -N http://host/remotefile.html

WGet will overwrite ./localfile.html if ./remotefile.html does not exist 
or is older than http://host/remotefile.html.
The expected behavior would be that ./localfile.html would be checked 
for timestamps instead of ./remotefile.html.


This is breaking my scripts because ./remotefile.html does not (and 
should not) exist, resulting in the file always downloading.




The workaround for now is:
1. Move or copy the ./localfile.html to ./tmp/remotefile.html
2. wget without the -O
3. Move or copy the ./tmp/remotefile.html to ./localfile


hi louis,

-O and -N were never meant to work together. in fact, the upcoming 1.11 
release of wget will forbid the use of -N if -O is given.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget alpha: -r --spider, number of broken links

2006-07-20 Thread Mauro Tortonesi


Stefan Melbinger ha scritto:

I don't think that non-existing robots.txt-files should be reported as 
broken links (as long as they are not referenced by some page).


Current output, if spanning over 2 hosts (e.g., -D 
www.domain1.com,www.domain2.com):


-
Found 2 broken links.

http://www.domain1.com/robots.txt referred by:
(null)
http://www.domain2.com/robots.txt referred by:
(null)
-

What do you think?


hi stefan,

of course you're right. but you are also late ;-)

in fact, this bug is already fixed in the current version of wget, which 
you can retrieve from our source code repository:


http://www.gnu.org/software/wget/wgetdev.html#development

thank you very much for your report anyway.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget alpha: -r --spider downloading all files

2006-07-20 Thread Mauro Tortonesi


Stefan Melbinger ha scritto:

Hi,

As you might have noticed I was trying to use wget as a tool to check 
for dead links on big websites. The combination of -r and --spider is 
working in the new alpha version, however wget is still downloading ALL 
files (no matter if they are parseable for further links or not), 
instead of just getting the status response for files other than 
text/html or application/xhtml+xml.


I don't think that this makes very much sense; the files are deleted 
anyway and downloading a 300MB video is not useful if you just want to 
check links and see whether the video is there at all.


Could somebody suggest a quick hack to disable the downloading of 
non-parseable documents? I think it must be somewhere in the area of 
http.c, somewhere around gethttp() or maybe http_loop() - unfortunately, 
my knowledge of C and my knowledge of this project weren't enough to get 
any satisfying result.


you're absolutely right, stefan. i've just started working on it.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget alpha: -r --spider downloading all files

2006-07-20 Thread Mauro Tortonesi


Stefan Melbinger ha scritto:

By the way, FTP transfers shouldn't be downloaded as a whole, too, in 
this mode.


well, the semantics of --spider for FTP are still not very clear to me.

at the moment, i was considering whether to simply perform FTP listing 
in case --spider is given, or to disable --spider for FTP URLs.


what do you think?

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wget Win32 Visual Studio project

2006-07-19 Thread Mauro Tortonesi


Christopher G. Lewis ha scritto:
Hi everyone - 


I've uploaded a working Visual Studio project file for the current TRUNK
in subversion. 


excellent. thank you very much, chris.


I'm pretty sure this is the 1.11 Alpha branch.


yes, it is.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Cannot retrieve files from ftp site.

2006-07-19 Thread Mauro Tortonesi


Spiros Melitsopoulos wrote:


Hi all

I am a newbie on wget, so there is a silly
question about its usage.
(If this list is not the place for such questions,
please let me know in order to avoid using it
for this kind of stuff.)

I try to dowlnload a directory with its contents
from an ftp site via proxy. Although the connection
is properly made, what i finaly get is the listing of
the contents of the directory in an .html page,
but none of its actual contents are downloaded.

what i used is wget -r -np -l 10 -w 10 --follow-ftp
while .wgetrc contains the proxy settings
and the following:

reclevel = 15
waitretry = 10
use_proxy = on
dirstruct = on
recursive = on
follow_ftp = on
glob = on
verbose = on
mirror = on
retr_symlinks = on

Can anybody contribute with a hint or advice?
I will be gratefull!


hi spiros,

recursive FTP through proxy has been broken for ages. fortunately, this 
bug was fixed in the recently released 1.11-alpha1 version of wget:


http://www.mail-archive.com/wget@sunsite.dk/msg09071.html

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Bug in wget 1.10.2 makefile

2006-07-17 Thread Mauro Tortonesi


Daniel Richard G. ha scritto:

Hello,

The MAKEDEFS value in the top-level Makefile.in also needs to include 
DESTDIR='$(DESTDIR)'.


fixed, thanks.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Documentation (manpage) bug

2006-07-17 Thread Mauro Tortonesi


Linda Walsh ha scritto:

FYI:

On the manpage, where it talks about no-proxy, the manpage
says:
--no-proxy
  Don't use proxies, even if the appropriate *_proxy environment
  variable is defined.

  For more information about the use of proxies with Wget,
   ^
  -Q quota

Note -- the sentence referring to more information about the use of
proxies stops in the middle of saying anything and starts with -Q quota.


fixed, thanks.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 alpha1 - content disposition filename

2006-07-17 Thread Mauro Tortonesi


Jochen Roderburg ha scritto:

Hi,

I was happy to see that a long missed future was now implemented in this alpha,
namely the interpretaion of the filename in the content dispostion header.
Just recently I had hacked a little script together to achieve this, when I
wanted to download a greater number of files where this was used  ;-)

I had a few cases, however, which did not come out as expected, but I think the
error is this time in the sending web application and not in wget.

E.g, a file which was supposed to have the name BW.txt came with the header:
Content-Disposition: attachment; filename=Bamp;W.txt;


the error is definitely in the web application. the correct header would be:

Content-Disposition: attachment; filename=BW.txt;


All programs I tried (the new wget and several browsers and my own script ;-)
seemed to stop parsing at the first semicolon and produced the filename Bamp.

Any thoughts ??


i think that the filename parsing heuristics currently implemented in 
wget are fine. you really can't do much better in this case.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wishlist: support the file:/// protocol

2006-07-17 Thread Mauro Tortonesi


David wrote:


In replies to the post requesting support of the “file://” scheme, requests
were made for someone to provide a compelling reason to want to do this. 
Perhaps the following is such a reason.


hi david,

thank you for your interesting example. support for “file://” scheme 
will be very likely introduced in wget 1.12.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: login incorrect

2006-07-17 Thread Mauro Tortonesi


Hrvoje Niksic ha scritto:

Gisle Vanem [EMAIL PROTECTED] writes:


Kinda misleading that wget prints login incorrect here. Why
couldn't it just print the 530 message?


You're completely right. It was an ancient design decision made by me
when I wasn't thinking enough (or was thinking the wrong thing).


hrvoje, are you suggesting to extend ftp_login in order to return both 
an error code and an error message?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Using --spider to check for dead links?

2006-07-17 Thread Mauro Tortonesi


Stefan Melbinger ha scritto:

Hello,

I need to check whole websites for dead links, with output easy to parse 
for lists of dead links, statistics, etc... Does anybody have experience 
with that problem or has maybe used the --spider mode for this before 
(as suggested by some pages)?


If this should work, all HTML pages would have to be parsed completely, 
while pictures and other files should only be HEAD-checked for existence 
(in order to save bandwidth)...


Using --spider and --spider -r was not the right way to do this, I fear.

Any help is appreciated, thanks in advance!


hi stefan,

historically, wget never really supported recursive --spider mode. 
fortunately, this has been fixed in 1.11-alpha-1:


http://www.mail-archive.com/wget@sunsite.dk/msg09071.html

so, it will be included in the upcoming 1.11 release.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Suggestion

2006-07-13 Thread Mauro Tortonesi


Kumar Varanasi ha scritto:

Hello there,

I am using WGET in my system to download http files. I see that there is no
option to download the file faster with multiple connections to the server.
Are you planning on a multi-threaded version of WGET to make downloads 
much faster?


no, there is no plan to implement parallel download at the moment.
however, please notice that it is highly unlikely that opening more than 
one connection with the same server will speed up the download process. 
parallel download makes sense only when more than one server is involved.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: mirror mode does not handle ../ properly

2006-07-13 Thread Mauro Tortonesi


Stefan Powell wrote:

In mirror mode (-m) accessing a page with relative links to parent
directories ( http://example.com/somepath/../somefile.html ) the two
dots are URL encoded.  The correct behavior is specified in section
5.2.4 of RFC3986.
(http://www.gbiv.com/protocols/uri/rfc/rfc3986.html#relative-dot-segments)


hi stefan,

which version of wget are you using? 1.11-alpha1 should have fixed that 
problem.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wget

2006-07-13 Thread Mauro Tortonesi


John McGill ha scritto:

Is there a way of telling wget to download the image and increment the file number 


wget does that automatically. suppose that you're using:

wget http://yoyodine.com/somepath/somefilename.txt

if another file named somefilename.txt is present in the current 
directory, the new file is named somefilename.txt.1. if you call wget 
again, the next file will be named somefilename.txt.2, and so on.



or add the date/time stamp?


you have to use the -N option for this.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wget

2006-07-13 Thread Mauro Tortonesi


Post, Mark K ha scritto:


You would want to use the -O option, and write a script to create a
unique file name to be passed to wget.


yes, something like this:

UNIQUE_FILENAME=`mktemp`
wget http://someserver.com/somepath/somefile.txt -O $UNIQUE_FILENAME

would probably work.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: BUG

2006-07-10 Thread Mauro Tortonesi


Tony Lewis ha scritto:


Run the command with -d and post the output here.


in this case, -S can provide more useful information than -d. be careful to 
 obfuscate passwords, though!!!


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget build/debugging!! -debugger tool

2006-07-10 Thread Mauro Tortonesi


bruce ha scritto:


when you guys are building/testing wget, are you ever using any kind of IDE?


no, i only use vim:

http://www.vim.org


and while i can get it to build using Eclipse on my linux box, i cant' seem
to figure out what i need to do within the settings to actually be able to
step into various functions once i'm in the main () function. and if you
can't step into/through functions.. debugging gets to be a pain!!!


not really. you can use gdb from command line:

http://www.gnu.org/software/gdb/

or a GUI front-end to GDB like insight:

http://sources.redhat.com/insight/

or ddd:

http://www.gnu.org/software/ddd/

but for network programming i have always found brian kernighan's approach 
(a well-placed printf is the best debugger) to be invaluable.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Windows compler need (was RE: wget 1.11 alpha 1 released)

2006-07-10 Thread Mauro Tortonesi


Christopher G. Lewis ha scritto:

OK, the Win32 compile is working, I've got both the SVN Trunk and the 1.11
alpha branch from
ftp://alpha.gnu.org/pub/pub/gnu/wget/wget-1.11-alpha-1.tar.gz . We'll
obviously work through the warnings that are coming up, and re-address the
CL parameters to fit with the VS 2005 C Compiler.


excellent.


I think we should make this the default supported compiler for the 1.11
release if we can confirm that we compile with VC++ Express (which is free
from MS).


i agree. MSVC 14 (AKA Visual Studio 2005's C Compiler) should be the 
default supported compiler for the future Wget releases. fortunately, i 
have been able to setup a build environment w/ MSVC 14 on my laptop so from 
now on i'll be able to help w/ Win32-related problems.



We should also double check on OpenSSL, since MS has now includes
MASM as a free download for VC++ Express users. I'm going to have to spin up
a VM to test the VC++ Express compile - I should be able to do this sometime
this weekend.


does VC++ Express provide also nmake?

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget - tracking urls/web crawling

2006-06-28 Thread Mauro Tortonesi


Tony Lewis wrote:
Bruce wrote: 



any idea as to who's working on this feature?


Mauro Tortonesi sent out a request for comments to the mailing list on March
29. I don't know whether he has started working on the feature or not.


yes. i haven't started coding it yet, though. i am still working on the 
last fixes for recursive spider mode.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: License of wget.texi: suggest removal of invariant sections

2006-06-28 Thread Mauro Tortonesi


Noèl Köthe wrote:

Am Montag, den 12.06.2006, 15:17 -0700 schrieb Don Armstrong:

Hello Hrvoje and Mauro,


I understand and agree with the reasoning behind removing the GPL as
the invariant section; but why also remove the GFDL as an invariant
section?


That's just because having the GFDL as an invariant section is a null
op; the GFDL itself already requires that it be included, and no one
can change it anyway (save from going to a later version if your
copyright statement allows it.) If it were to stay as an invariant
section, I don't think it would cause a problem for Debian, but I
really don't see any reason from your perspective to do so.

I suggested removing them both because I figured if you were to modify
it at all, you may as well just modify it once.

Thanks for working with us on this issue!


I checked 1.11 alpha1 and svn trunk but both are still there. Do you
already have decided to remove GFDL from this section, too?


yes. i've just talked with hrvoje about it, and we reached consensus on 
changing both the GPL and the GFDL sections from invariant to normal.



Just for your info and not to hurry you:
Debian will start freezing in Jul and because wget is an important part
I want to have it resolved before.:)


i'll do my best to release wget 1.11 ASAP. IIRC, fedora should freeze in 
July as well. it would be nice to have wget 1.11 included in both the 
new versions of debian and fedora.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget issues on Solaris

2006-06-19 Thread Mauro Tortonesi


Kommineni, Devendra wrote:


If invoked using the DNS alias for the test cluster, it fails after
several retries. (we have two servers in the cluster)


that's weird. it seems that the TCP connection is correctly established 
but is dropped after the HTTP request is sent. maybe there's something 
wrong at the HTTP level. perhaps you could turn on the -S option to 
examine HTTP requests and responses?



The problem does not occur on the linux nodes  ( with wget 1.10.1).


what do you mean? are you saying that the problem arises only with a 
specific version of wget (possibly prior to 1.10.1) or with a specific 
non-linux platform?



--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

wget 1.11 alpha 1 released

2006-06-13 Thread Mauro Tortonesi



hi to everybody,

i've just released wget 1.11 alpha 1:

ftp://alpha.gnu.org/pub/pub/gnu/wget/wget-1.11-alpha-1.tar.gz

you're very welcome to try it and report every bug you might encounter.


with this release, the development cicle for 1.11 officially enters the 
feature freeze state. wget 1.11 final will be released when all the 
following tasks are completed:


1) win32 fixes (setlocale, fork)
2) last fixes to -r and --spider
3) update documentation
4) return error/warning if multiple HTTP headers w/ same name are given
5) return error/warning if conflicting options are given
6) fix Saving to: output in case -O is given

unfortunately, this means that all the planned major changes (gnunet 
support, advanced URL filtering w/ regex, etc...) will have to wait until 
1.12. however, i think that the many important features and bugfixes 
recently commited into the trunk more than justify the new, upcoming 1.11 
release.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget 1.11 alpha 1 released

2006-06-13 Thread Mauro Tortonesi


Steven M. Schweda wrote:

From: Mauro Tortonesi


ftp://alpha.gnu.org/pub/pub/gnu/wget/wget-1.11-alpha-1.tar.gz


   I assume that it would be pointless to look for the VMS changes here,
but feel free to amaze me.


i promise we'll seriously talk about merging your VMS changes into wget 
at the beginning of the 1.12 development cycle.


you'll be very welcome to convince me about the soundness of your code 
and the need to merge VMS support into wget via your favorite IM tool:


http://www.tortonesi.com/contactme.shtml

however, for the moment i have to focus on the 1.11 release.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wget and proxy (again)

2006-05-26 Thread Mauro Tortonesi


Leonardo wrote:

Hi all,
 I have a PC behind a proxy and I'm not able to wget files, nor
ping (althought correct addresses are found : PING
www.l.google.com (66.249.85.99)...) , but I can surf if I set
the correct proxies settings.
I've read the F. wget manual and searched the net, but maybe
it's just I don't understand.

I do:  wget
ftp://ftp.gentoo.mesh-solutions.com/gentoo/snapshots/portage-20060525.tar.bz2.md5sum

and have the variables:
http_proxy=http://www-proxy.physi.uni-heidelberg.de:3128
ftp_proxy=ftp://www-proxy.physi.uni-heidelberg.de:3128

What I get is:
ftp://ftp.gentoo.mesh-solutions.com/gentoo/snapshots/portage-20060525.tar.bz2.md5sum
   = `portage-20060525.tar.bz2.md5sum'
Resolving www-proxy.physi.uni-heidelberg.de... 129.206.32.243
Connecting to
www-proxy.physi.uni-heidelberg.de|129.206.32.243|:3128...
connected.
Logging in as anonymous ... 
Error in server response, closing control connection.

Retrying.

and so on and on. I tried also 
passive_ftp = on and/or   use_proxy = on

on /etc/wgetrc  without result.

What do I do wrong?


hi leonardo,

could you please tell us which version of wget you are using and post 
the output of wget with the -S and -d options turned on?


it's impossible to find out what's the problem with the simple output 
you provided.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Out of Memory Error

2006-05-25 Thread Mauro Tortonesi


[EMAIL PROTECTED] wrote:

I ran wget (1.9.1) on Debian GNU/Linux to find out how many links my site had, 
and after Queue count 66246, maxcount 66247 links, the wget process ran out of 
memory. Is there a way to set the persistent state to disk instead of memory so 
that all the system memory and cache is not slowly consumed until the process 
halts? My site may have 1 M to 2 M links.


hi oscar,

exactly how much memory does wget take? could you please try if the most 
recent version of wget (1.10.2) gives you the same problem?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: WGET -O Help

2006-05-25 Thread Mauro Tortonesi


Steven M. Schweda wrote:


   But the real question is: If a Web page has links to other files, how
is Wget supposed to package all that stuff into _one_ file (which _is_
what -O will do), and still make any sense out of it?


even more, how is Wget supposed to properly postprocess the saved data, 
which can as well be a combination of HTML pages and binary files?


from my perspective the main problem with -O is that wget users seem not 
to understand its semantics. -O behaves as an stdio redirection (or a 
pipeline concatenation in case of wget -O - | someothercommand) in 
shell, and presents some non-negligible limitations (e.g. in 
postprocessing of the saved data). -O was never meant to provide a 
rename saved files after download and postprocessing semantics.


perhaps we should make this clear in the manpage and provide an 
additional option which just renames saved files after download and 
postprocessing according to a given pattern. IIRC, hrvoje was just 
suggesting to do this some time ago. what do you guys think?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Exclude directorie

2006-05-19 Thread Mauro Tortonesi


Antoine Bonnefoy wrote:

Hy,
I found a bug with the option -X in recursive mode. When i use a wildcard,
in the exclude string, it's only works for one level string.
For example:
for this directory architecture :
server:
  =level1
   = Data
   = level2
= Data

wget -X */Data -r http://server/level1/
works correctly for exclude directory Data
but don't exclude the Data directory in level2

The bug come from the fnmatch function.
I correct it for me in the utils.c file with deactivate the flag
FNM_PATHNAME in the proclist() function.

Is it the right comportment?

I hope its help

Excuse me for my English


hi antoine,

could you please tell us which version of wget you are using? after the 
release of 1.10.2 i have merged a patch that fixed a few bugs in -X 
support, so you might want to try the current version of wget available 
from our subversion repository:


http://www.gnu.org/software/wget/wgetdev.html#development

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Redirect makes wget fetch another domain

2006-05-19 Thread Mauro Tortonesi


Equipe web wrote:

I've come across this annoying bug :
Even though wget is told not to span other hosts, it does when  
redirected !!!


This bug has been waiting for a fix for quite a long time :

http://www.mail-archive.com/wget@sunsite.dk/msg01675.html

I don't know how to make things change as I'm not a programmer myself...


hi luc,

thank you very much for your bug report. which version of wget are you 
using? i have recently merged a couple of patches that fixed a few bugs, 
so you might want to try the current version of wget available from our 
subversion repository:


http://www.gnu.org/software/wget/wgetdev.html#development

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: [Fwd: Bug#366434: wget: Multiple 'Pragma:' headers not supported]

2006-05-19 Thread Mauro Tortonesi


Noèl Köthe wrote:

Hello,

a forwarded report from http://bugs.debian.org/366434

could this behaviour be added to the doc/manpage?


i wonder if it makes sense to add generic support for multiple headers 
in wget, for instance by extending the --header option like this:


wget --header=Pragma: xxx --header=dontoverride,Pragma: xxx2 someurl

as an alternative, we could choose to support multiple headers only for 
a few header types, like Pragma. however, i don't really like this 
second choise, as it would require to hardcode the above mentioned 
header names in the wget sources, which IMVHO is a *VERY* bad practice.


what do you think?

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: -O switch always overwrites output file

2006-05-19 Thread Mauro Tortonesi


Toni Casueps wrote:
I use Wget 1.10 for Linux. If I use -O and there was already a file in 
the current directory with the same name it overwrites it, even if I use 
-nc. Is this a bug or intentional?


IMVHO, this is a bug. if hrvoje does not provide a rationale for this 
behavior, i will fix it before the release of wget 1.11 (which should be 
pretty soon).


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wrong exit code

2006-05-19 Thread Mauro Tortonesi


Lars Wilke wrote:

Hi,

first this is not a real bug and is more like a wishlist item.

So the problem:

When invoking wget to retrieve a file via ftp all is fine
if the file exists and wget is able to retrieve it. The return
code from wget is 0. If the file is not found on the server the
return code is 1. Good.

I expected that wget would behave the same when using file globbing.
If the file can be found via a pattern and can be downloaded
wget returns with 0. But if the file can not be found after
successfully retrieving a directory listing wget returns with 0, too!
IMHO here wget should exit with the same error code (1) as above.

I searched the docs if this behaviour is mentioned somewhere but
have not found it mentioned. Therefor i am sending this email.
Sorry if i missed this detail mentioned somewehere.


hi lars,

unfortunately one of wget's weak points is its lack of consistency for 
the returned error codes. after the release of wget 1.11, i am planning 
some major architectural change for wget. that will be the best time to 
 redesign the code which handles returned error values.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Missing K/s rate on download of 13MB file

2006-05-19 Thread Mauro Tortonesi


J. Grant wrote:

Hi,

On 14/05/06 21:26, Hrvoje Niksic wrote:


J. Grant [EMAIL PROTECTED] writes:


Could an extra value be added which lists the average rate? average
rate: xx.xx K/s ?


Unfortunately it would have problems fitting on the line.


Perhaps the progress bar would be reduced?


i don't think that would be a good idea.


or the default changed to be the average rate?


i don't think that would be a good idea either. but...

or if neither of those are suitable, could a conf file setting be added 
so we can switch between average rate, and current rate?


...this is an interesting proposal. however, my todo list is already 
*HUGE* and grows larger every day. so i really doubt i will have time to 
implement this feature (at least for the next months). you're very 
welcome to proceed w/ the development of configurable average 
calculation code and send me a patch, though.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: recursive download

2006-05-19 Thread Mauro Tortonesi


Ajar Taknev wrote:

Hi,

I am trying to recursively download from an ftp site without success.
I am behind a squid proxy and I have setup the .wgetrc correctly. When
I do wget -r ftp:/ftp.somesite.com/dir it fetches
ftp.somesite.com/dir/index.html and exits. It doesn't do a recursive
download. When I do the same thing from a machine which is not behind
a proxy the same command does a recursive download. proxy is
squid-2.5.STABLE9 and wget version is  1.10.2. Any ideas what the
problem could be?


recursive FTP retrieval through HTTP proxies has been broken for a long 
time. i have received a patch that should fix the problem some time ago, 
but i haven't been able to test it yet. however, this is one of the 
pending bugs that will be fixed before the upcoming 1.11 release.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: links to follow

2006-05-19 Thread Mauro Tortonesi


Andrea Rimicci wrote:

Hi all,
I'd like retrieve a web document where some links are coded in
javascript calls, so I'd like instruct wget when a something like
JSfunc('my/link/to/follow/') is matched, he recognize
'my/link/to/follow/' as a link to follow.

Is there any way to accomplish this?
Maybe using regexps, to setup which patterns will trigger the link,
will be great.

TIA, Andrea

P.S. dunno if this was already discussed, Ive not found any previous 
post with 'follow' in subject.


hi andrea,

wget does not support parsing of javascript code at the moment, nor 
regexps on downloaded file content. however, we are planning to add 
support for regexps in wget 1.12, and possibly for external url parsers.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: fixed recursive ftp download over proxy and 1.10.3

2006-05-19 Thread Mauro Tortonesi


[EMAIL PROTECTED] wrote:

Hi,


I have been embarrassed with the ftp over http bug . for quite a while : 1.5 years. 


I was very happy to learn that someone had developped a patch.
Happier to read that you would merge it shortly.

Do you know when you will be able to publish this 1.10.3 release ?


1.10.3 will never be released. the next version of wget will be 1.11, 
and i hope i will be able to release it by the end of june.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: [Fwd: Bug#366434: wget: Multiple 'Pragma:' headers not suppor ted]

2006-05-19 Thread Mauro Tortonesi


Herold Heiko wrote:

From: Mauro Tortonesi [mailto:[EMAIL PROTECTED]
i wonder if it makes sense to add generic support for 
multiple headers 
in wget, for instance by extending the --header option like this:


wget --header=Pragma: xxx --header=dontoverride,Pragma: 
xxx2 someurl



That could be a problem if you need to send a really weird custom header
named dontoverride,Pragma. Probability is near nil but with the whole big
bad internet waiting maybe separating switches (--header and --header-add)
would be better.


you're right. in fact, i like hrvoje's --append-header proposal better.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Wget mirror restrict asp variables

2006-05-08 Thread Mauro Tortonesi


MarK wrote:

Hi,

how can I mirror a set of pages of a site restricting one ore more variable 
defined in the URL?


For example:
http://www.thesite.com/page.apsx?f=123

I want to mirror teh site starting from this page and all the linked pages 
only if they have f=123


I tried 
wget -m -k -E -A*f=123 http://www.thesite.com/page.aspx?f=123


but this only download that page. Removing the -A option wget download the 
whole site.


unfortunately, at the moment wget does not allow you to restrict the set 
of downloaded files according to a specific query value. this feature is 
planned for a next release, though.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget www.openbc.com post-data/cookie problem

2006-05-03 Thread Mauro Tortonesi


Erich Steinboeck wrote:

Mauro Tortonesi wrote:

this might be a problem with your server. could you please provide us 
with the output of wget with the -S option turned on?


[...]



---response begin---
HTTP/1.1 200 OK
Date: Tue, 02 May 2006 15:01:45 GMT
Server: Apache
Expires: Now
Pragma: no-cache
Cache-control: private
Connection: close
Content-Type: text/html; charset=UTF-8


hi eric,

as you can see the problem is with the web server, that does not return 
a cookie (by means of the Set-Cookie header) to wget.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Feature Request: Metalink support (mirror list file)

2006-05-02 Thread Mauro Tortonesi


anthony l. bryan wrote:

Hi,

I realize this may be out of the scope of wget, so I hope I don't offend
anyone.


that's a very interesting proposal, actually. but is metalink a widely 
used format to describe resource availability from multiple URLs?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: Rtsp/mms support

2006-05-02 Thread Mauro Tortonesi


Ryan Golhar wrote:

I've seen several messages about this, but haven't determined if this
will be implemeneted or not

Will wget support rtsp (and/or mms)?  If not, I'd be intersted in
implementing it.


i am very interested in adding both rtsp and mms support to wget. 
however, since this might require significant changes and i am planning 
a major overhaul of wget's architecture, for the moment i think i will 
stick to my bugfixing and redesign tasks and leave rtsp/mms support for 
later.


however, you're very welcome to take care of realizing rtsp/mms support 
for wget. in case you will take this task, please let me know so that we 
can coordinate and decide together the best design for this improvement.



BTW, today i have taken a very brief look at the source code of mplayer. 
in the libmpdemux directory there is some code that we could borrow to 
speed up rtsp/mms development.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget www.openbc.com post-data/cookie problem

2006-05-02 Thread Mauro Tortonesi


Erich Steinboeck wrote:
Being new to wget (I'm using GNU Wget 1.10.2 for Windows) I'm trying to 
log into www.openbc.com.  It works perfectly with a browser, but I can't 
get it to work with wget.


...


Can anyone help?  What am I doing wrong here?  Thanks!!


this might be a problem with your server. could you please provide us 
with the output of wget with the -S option turned on?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget: no support for CSS-Background-images

2006-04-04 Thread Mauro Tortonesi


Michael Probst wrote:

Hi Folks,

thank you for all the work you have been spending on wget!

I found a little thing, though:
My version (GNU Wget 1.9+cvs-dev) will not support css background- images.
Take a look at the exemple I send with. If you put it into a httpdocs- 
Directory of a web server and try to get it via

wget -r http://localhost/_temp_css_wget_problem/
there will be no background image at the end.


hi micheal,

it's a known problem. wget does not parse css stylesheets or javascript 
code for urls.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: problems after upgrading to fedora core 5

2006-04-03 Thread Mauro Tortonesi


cliff wrote:

Good news for wget. Building from the source worked.
So for some reason, either my system is screwed or the binary with FC5 was
misbuilt. Seems hard to believe latter but this box was a pretty bare,
standard FC3 and was just a straight, easy upgrade to FC5.


that's very weird. i've just taken a look at fedora's wget-1.10.2-3.2.1 
RPM binary (i suppose that's the version you are using, could you please 
check that with the rpm -q wget command?) and it does not include any 
patch that could modify wget's default behaviour.



In either case do you know of any settings that would affect this so I
might better know what to say/search on the fedora lists?


not really. the behaviour you described is very awkward. the only things 
i can think of is a misconfigured /etc/nsswitch.conf, with the


hosts: dns6

line.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: wget stopped working

2006-04-03 Thread Mauro Tortonesi


Jana Mccoy wrote:

wget stopped working after I downloaded the bc functions.


hi jana,

what exactly are the bc functions you're talking about? are they 
related to wget in any way?



What is an ERROR -1: Malformed status line?


it means wget failed to parse the HTTP response returned by the web server.


Here's what I'm entering and the reply:

$ wget  http://www.yahoo.com
--21:51:21--  http://www.yahoo.com/
   = `index.html'
Resolving www.yahoo.com... 68.142.226.42, 68.142.226.48, 68.142.226.33, ...
Connecting to www.yahoo.com|68.142.226.42|:80... connected.
HTTP request sent, awaiting response... -1
21:51:21 ERROR -1: Malformed status line.


could you please tell us which version of wget you are using and send us 
the result of wget -v -d http://www.yahoo.com?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: error in the french help translation of wget help

2006-04-03 Thread Mauro Tortonesi


nicolas figaro wrote:

Hi,

there is a mistake in the french translation of wget --help (on linux 
redhat).


in english :
wget --help | grep spider
  --spider  don't download anything

was translated in french this way :
wget --help | grep spider
  --spider  ne pas télécharger n'importe quoi.

an english translation could be :
don't download anything weird

a correct translation could have been :
ne rien télécharger
ne télécharger aucun fichier

but with the recent french law, this message makes wget a very 
interesting and smart tool.


hi nicolas,

as wget's development webpage states:

http://www.gnu.org/software/wget/wgetdev.html

the coordination of translation efforts for GNU tools is done by the 
Translation Project:


http://www.iro.umontreal.ca/translation/

you should contact them (which i included in CC) to report errors in 
current french translation of wget.


thank you very much for your help.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi


Hrvoje Niksic wrote:

Tony Lewis [EMAIL PROTECTED] writes:


I don't think ,r complicates the command that much. Internally,
the only additional work for supporting both globs and regular
expressions is a function that converts a glob into a regexp when
,r is not requested.  That's a straightforward transformation.


,r makes it harder to input regexps, which are the whole point of
introducing --filter.  Besides, having two different syntaxes for the
same switch, and for no good reason, is not really acceptable, even if
the implementation is straightforward.


i agree 100%. and don't forget that globs are already supported by 
current filtering options.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi


Curtis Hatter wrote:

On Friday 31 March 2006 06:52, Mauro Tortonesi:


while i like the idea of supporting modifiers like quick (short
circuit) and maybe i (case insensitive comparison), i think that (?i:)
and (?-i:) constructs would be overkill and rather hard to implement.


I figured that the (?i:) and (?-i:) constructs would be provided by the 
regular expression engine and that the --filter switch would simply be able 
to use any construct provided by that engine.


i know, that would be really nice.

If, as you said, this would be hard to implement or require extra effort by 
you that is above and beyond that required for the more standard constructs 
then I would say that they shouldn't be implemented; at least at first.


i agree.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: can't recurse if no index.html

2006-04-03 Thread Mauro Tortonesi


Dan Jacobson wrote:

I notice with server created directory listings, one can't recurse.
$ lynx -dump http://localhost/~jidanni/test|head
Index of /~jidanni/test
 Icon   [1]Name  [2]Last modified  [3]Size  [4]Description
  ___
 [DIR]  [5]Parent Directory -
 [TXT]  [8]cd.html 23-Feb-2006 20:55  931
$ wget --spider -S -r http://localhost/~jidanni/test/
localhost/~jidanni/test/index.html: No such file or directory


hi dan,

unfortunately, --spider is broken when used with -r. i am working right 
now in order to fix this bug.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: problem with downloading when HREF has ../

2006-04-03 Thread Mauro Tortonesi


Vladimir Volovich wrote:

MT == Mauro Tortonesi writes:

  I addressed this bug in wget few months ago.  See the fix here:
  
  http://www.mail-archive.com/wget@sunsite.dk/msg08516.html


 MT hi frank,

 MT i am going to test and apply your patch later this week, as well
 MT as many other pending patches. unfortunately i am still working
 MT on my ph.d.  thesis at the moment, so i don't have much time to
 MT work on wget.  however, since i believe my thesis should be ready
 MT tomorrow or wednesday at most, i am planning to spend the rest of
 MT the week to catch up with wget.

are there any news on the wget update?


hrvoje fixed this problem more than one month ago. from the ChangeLog:


2006-02-27  Hrvoje Niksic  [EMAIL PROTECTED]

* url.c (path_simplify): Don't preserve .. at beginning of path.
Suggested by Frank McCown.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

--spider and -r

2006-04-03 Thread Mauro Tortonesi



dan jacobson recently reported a bug with --spider and -r:

http://www.mail-archive.com/wget@sunsite.dk/msg08797.html

hrvoje confirms this bug has been in wget for a long time, mainly 
because the semantics of --spider and -r were never properly defined.


from my point of view, it makes sense that when a user specifies both 
--spider and -r, wget:


1) downloads resources according to -r, printing an error (with non
   exisiting url  referer) in case of non existing URLs
2) parses downloaded resources for url
3) deletes downloaded resources

do you think this is the correct semantics for --spider and -r? am i 
missing something here?


(notice that there are significant similarities with the behaviour of -r 
and --delete-after, with the only exception of printing errors in case 
of non existing URLs.)


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: problem with downloading when HREF has ../

2006-04-03 Thread Mauro Tortonesi


Hrvoje Niksic wrote:

Vladimir Volovich [EMAIL PROTECTED] writes:


MT == Mauro Tortonesi writes:

 are there any news on the wget update?

MT hrvoje fixed this problem more than one month ago. from the
MT ChangeLog:

i don't see the official source at ftp.gnu.org/gnu/wget/

that's what i'm asking about.


The fix will appear in the next release, 1.11.  Mauro's paragraph you
quoted (beginning with i am going to test and apply your patch later
this week) referred to applying the patch to the version control
repository, not to the timeframe of releasing 1.11.

It is my understanding that 1.11 will be released within the next
couple of months; Mauro might give a more precise date.


wget 1.11 will definitely be released in the next couple of months, but 
i can't be more precise in this moment. at the beginning, i was thinking 
about adding support for regex, gnunet and fix gnutls support in that 
release. now i am reconsidering whether to delay these new features for 
1.12 and focus on fixing the incredible number of recently reported bugs 
instead.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Scott Scriven wrote:

* Mauro Tortonesi [EMAIL PROTECTED] wrote:

wget -r --filter=-domain:www-*.yoyodyne.com

This appears to match www.yoyodyne.com, www--.yoyodyne.com,
www---.yoyodyne.com, and so on, if interpreted as a regex.

not really. it would not match www.yoyodyne.com.

It would most likely also match www---zyoyodyneXcom.

yes.

Perhaps you want glob patterns instead? I know I wouldn't mind having
glob patterns in addition to regexes... glob is much eaesier
when you're not doing complex matches.

no. i was talking about regexps. they are more expressive and powerful
than simple globs. i don't see what's the point in supporting both.

If I had to choose just one though, I'd prefer to use PCRE,
Perl-Compatible Regular Expressions. They offer a richer, more
concise syntax than traditional regexes, such as \d instead of
[:digit:] or [0-9].

i agree, but adding a dependency from PCRE to wget is asking for
infinite maintenance nightmares. and i don't know if we can simply
bundle code from PCRE in wget, as it has a BSD license.

--filter=[+|-][file|path|domain]:REGEXP

is it consistent? is it flawed? is there a more convenient one?

It seems like a good idea, but wouldn't actually provide the
regex-filtering features I'm hoping for unless there was a raw
type in addition to file, domain, etc. I'll give details
below. Basically, I need to match based on things like the
inline CSS data, the visible link text, etc.

do you mean you would like to have a regex class working on the content
of downloaded files as well?

Below is the original message I sent to the wget list a few
months ago, about this same topic:

=
I'd find it useful to guide wget by using regular expressions to
control which links get followed. For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any add to cart or buy links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

a
href=http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCartamp;g2_itemId=11436amp;g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7bamp;g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7bamp;g2_returnName=album;
class=gbAdminLink gbAdminLink gbLink-cart_AddToCartadd to cart/a

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or... if there's already a way to do this, let me know. I
didn't see anything in the docs, but I may have missed something.

:)
=

I think what I want could be implemented via the --filter option,
with a few small modifications to what was proposed. I'm not
sure exactly what syntax to use, but it should be able to specify
whether to include/exclude the link, which PCRE flags to use, how
much of the raw HTML tag to use as input, and what pattern to use
for matching. Here's an idea:

--filter=[allow][flags,][scope][:]pattern

Example:

'--filter=-i,raw:add ?to ?cart'
(the quotes are there only to make the shell treat it as one parameter)

The details are:

allow is + for include or - for exclude.
It defaults to + if omitted.

flags, is a set of letters to control regex options, followed
by a comma (to separate it from scope). For example, i
specifies a case-insensitive search. These would be the same
flags that perl appends to the end of search patterns. So,
instead of /foo/i, it would be --filter=+i,:foo

scope controls how much of the a or similar tag gets used
as input to the regex. Values include:
raw: use the entire tag and all contents (default)
a href=/path/to/foo.extbar/a
domain: use only the domain name
www.example.com
file: use only the file name
foo.ext
path: use the directory, but not the file name
/path/to
others... can be added as desired

: is required if allow or flags or scope is given

So, for example, to exclude the add to cart links in my
previous post, this could be used:

--filter=-raw:'AddToCart|add to cart'
or
--filter=-raw:AddToCart\|add\ to\ cart
or
--filter=-:'AddToCart|add to cart'
or
--filter=-i,raw:'add ?to ?cart'

Alternately, the --filter option could be split into two options:
one for including content, and one for excluding. This would be
more consistent with wget's existing parameters, and would
slightly simplify the syntax.

I hope I haven't been to full of hot air. This is a feature I've
wanted in wget for a long time, and I'm a bit excited that it
might happen soon. :)

i don't like your raw proposal as it is HTML-specific. i would like
instead to develop a mechanism which could work for all supported
protocols

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi


Hrvoje Niksic wrote:

Mauro Tortonesi [EMAIL PROTECTED] writes:



Scott Scriven wrote:


* Mauro Tortonesi [EMAIL PROTECTED] wrote:



wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match www.yoyodyne.com, www--.yoyodyne.com,
www---.yoyodyne.com, and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com. 


Why not?


i may be wrong, but if - is not a special charachter, the previous 
expression should match only domains starting with www- and ending in 
[randomchar]yoyodyne[randomchar]com.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi


Oliver Schulze L. wrote:

Hrvoje Niksic wrote:


 The regexp API's found on today's Unix systems
might be usable, but unfortunately those are not available on Windows.


My personal idea on this is to: enable regex in Unix and disable it on 
Windows.


We all use Unix/Linux and regex is really usefull. I think not having 
regex on
Windows will not do any more harm that it is doing now (not having it at 
all)


for consistency and to avoid maintenance problems, i would like wget to 
have the same behavior on windows and unix. please, notice that if we 
implemented regex support only on unix, windows binaries of wget built 
with cygwin would have regex support but native binaries wouldn't. that 
would be very confusing for windows users, IMHO.


I hope wget can get conection cache, 


this is planned for wget 1.12 (which might become 2.0). i already have 
some code implementing connection cache data structure.


URL regex 


this is planned for wget 1.11. i've already started working on it.


and advanced mirror functions (sync 2 folders) in the near future.


this is very interesting.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it

1 2 3 >

1 - 100 of 298 matches

Mail list logo