cool1...@gmail.com writes:

> Here are some scripts, how do I put them together to create the script
> I want? (to search a online document and download all the links in it)
> p.s: can I set a destination folder for the downloads?

You can use os.chdir to go to the desired folder.
>
> urllib.urlopen("http://....";)
>
> possible_urls = re.findall(r'\S+:\S+', text)
>
> import urllib2
> response = urllib2.urlopen('http://www.example.com/')
> html = response.read()

If you insist on not using wget, here is a simple script with
BeautifulSoup (v4):

########################################################################
from bs4 import BeautifulSoup
from urllib2 import urlopen
from urlparse import urljoin
import os
import re

os.chdir('OUT')

def generate_filename(url):
    url = re.sub('^[a-zA-Z0-9+.-]+:/*', '', url)
    return url.replace('/', '_')

URL = "http://www.example.com/";
soup = BeautifulSoup(urlopen(URL).read())

links = soup.select('a[href]')
for link in links:
    url = urljoin(URL, link['href'])
    print url
    html = urlopen(url).read()
    fn = generate_filename(url)
    with open(fn, 'wb') as outfile:
        outfile.write(html)
########################################################################

You should add a more intelligent filename generator, filter out mail:
urls and possibly others and add exception handling for HTTP errors.
-- 
Piet van Oostrum <p...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to