Jean-Marc MOLINA wrote:
Hello,

I want to archive a HTML page and « all the files that are necessary to
properly display » it (Wget manual), plus all the linked images (<a
href="linked_image_url"><img src="inlined_image_url"></a>). I tried most
options and features : recursive archiving, including and excluding
directories and file types... But I can't make up the right options to only
archive the "index.html" page from the following hierarchy :

/pages/index.html          ; displays image_1.png, links to image_2.png
/pages/page_1.html         ; linked from index.html
/pages/images/image_1.png
/images/image_2.png

Consider image_2.png as a thumb of the image_1.png, that's why it's so
important to archive it.

The archive I want to get :

/pages/index.html
/pages/images/image_1.png
/images/image_2.png

If I set the -r -l1 (recursivity on, level 1) and -P (--page-requisites :
necessary files) I will get the page_1.html, and I don't want it. And it
seems excluding the /pages directory or only including "png" files doesn't
affect the -P option behaviour.

How can I force Wget not to archive the page_1.html file ? At least I would
like it to clean the archive at the end. Also note that the page I try to
archive links to many pages that I want to exclude from the archive so I
can't affort to clean it manually.

JM.

I'm afraid wget won't do exactly what you want it to do. Future versions of wget may enable you to specify a wildcard to select which files you'd like to download, but I don't know when you can expect that behavior.

In the meantime I'd recommend writing a script which performs a wget -P on the index.html file and then pulls out all links to images and feeds those to wget for retrieval.

Regards,
Frank

Reply via email to