Re: Python Web Scrapping : Within href readonly those value that have href in it

Peter Otten Sat, 14 Jan 2017 00:48:13 -0800

[email protected] wrote:

> I am trying to scrape a webpage just for learning. In that webpage there
> are multiple "a" tags. consider the below code
> 
> <a href='\abc\def\jkl'> Something </a>
> 
> <a href ='http:\\www.google.com'> Something</a>


These are probaly all forward slashes.

> Now i want to read only those href in which there is http. My Current code
> is
> 
> for link in soup.find_all("a"):
>     print link.get("href")
> 
> i would like to change it to read only http links.

You mean href values that start with "http://";?
While you can do that with a callback

def check_scheme(href):
    return href is not None and href.startswith("http://";)

for a in soup.find_all("a", href=check_scheme):
    print(a["href"])

or a regular expression

import re

for a in soup.find_all("a", href=re.compile("^http://";)):
    print(a["href"])

why not keep things simple and check before printing? Like

for a in soup.find_all("a"):
    href = a.get("href", "") # empty string if href is missing
    if href.startswith("http://";):
        print(href)


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Python Web Scrapping : Within href readonly those value that have href in it

Reply via email to