Hello wget users This is not a bug report, but I understood, that this mailing list may also be used for user questions.
I want to archive parts of a website (www.pokerstrategy.com) and make these available locally including images, videos, PDFs etc. The page requires me to login in order to access the content, but I figured out how to do that already. The website exists in different languages and has different sub-domains for each language. E.g. de.pokerstrategy.com, fr.pokerstrategy.com etc. I'm only interested in one language. The website is very big, so I don't want to download everything. Fortunately, the html documents of all pages I'm interested in are in one folder (or its subfolders): A small portion of this website which can be used to demonstrate my problem is www.pokerstrategy.com/strategy/live-poker. Unfortunately, media is distributed across a limited number of different domains (static.pokerstrategycdn.com, peacock.pokerstrategy.com etc.) as well as a different folder on the same server ( www.pokerstrategy.com/downloads). So what I need to do is: * from the start url decend into sub-folders (e.g. /strategy/live-poker -> /strategy/live-poker/1022), but not ascend to parent or sibling folders * download CSS styles too * download any media (jpg, jpeg, png, gif, flv, wmf, avi, mpg, mpeg, pdf etc.), even if located on different domains * do not follow any cross-domain links/references EXCEPT if for media files * make everything available offline, completely including styles and media. Only links to files/documents that were not downloaded should still point to the original url. * adjust extensions if necessary * use cookies.txt from local folder I tried different options for wget, but now I'm stuck. For example, I tried: wget --tries=3 --retry-connrefused --no-clobber --load-cookies=cookies.txt --convert-links --page-requisites --adjust-extension --recursive --include-directories /strategy/live-poker,/download http://www.pokerstrategy.com/strategy/live-poker This correctly downloads only the html documents I want and also gets the media files from the /download folder, but: - does not modify the html so that <img>-Tags point to the downloaded files (however, it does modify <a>-Tags that link to local html documents) - does not get media files from other domains. If for example I add --span-hosts, it simply gets too much (all documents from different language versions of the website that I don't need). Note: For the example URL I provided here you won't need to log in and thus the load-cookies option can be waived. Any help would be greatly appreciated. Kind regards, Alexander
