On Wednesday, November 25, 2015 at 5:30:14 PM UTC-5, Grobu wrote: > Hi > > It seems that links on that Wikipedia page follow the structure : > <a href="..." title="..."> > > You could extract a list of link titles with something like : > re.findall( r'\<a[^>]+title="(.+?)"', html ) > > HTH, > > -Grobu- > > > On 25/11/15 21:55, MRAB wrote: > > On 2015-11-25 20:42, ryguy7272 wrote: > >> Hello experts. I'm looking at this url: > >> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names > >> > >> I'm trying to figure out how to list all 'a title' elements. For > >> instance, I see the following: > >> <a title="Accident, Maryland" > >> href="/wiki/Accident,_Maryland">Accident</a> > >> <a class="new" title="Ala-Lemu (page does not exist)" > >> href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a> > >> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a> > >> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse > >> Peaks</a> > >> > >> So, I tried putting a script together to get 'title'. Here's my attempt. > >> > >> import requests > >> import sys > >> from bs4 import BeautifulSoup > >> > >> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names" > >> source_code = requests.get(url) > >> plain_text = source_code.text > >> soup = BeautifulSoup(plain_text) > >> for link in soup.findAll('title'): > >> print(link) > >> > >> All that does is get the title of the page. I tried to get the links > >> from that url, with this script. > >> > > A 'title' element has the form "<title ...>". What you should be looking > > for are 'a' elements, those of the form "<a ...>". > > > >> import urllib2 > >> import re > >> > >> #connect to a URL > >> website = > >> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names') > >> > >> > >> #read html code > >> html = website.read() > >> > >> #use re.findall to get all the links > >> links = re.findall('"((http|ftp)s?://.*?)"', html) > >> > >> print links > >> > >> That doesn't work wither. Basically, I'd like to see this. > >> > >> Accident > >> Ala-Lemu > >> Alert > >> Apocalypse Peaks > >> Athol > >> Å > >> Barbecue > >> Båstad > >> Bastardstown > >> Batman > >> Bathmen (Battem), Netherlands > >> ... > >> Worms > >> Yell > >> Zigzag > >> Zzyzx > >> > >> How can I do that? > >> Thanks all!!
Thanks!! Is that regex? Can you explain exactly what it is doing? Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. Can you just please explain what it's doing??? -- https://mail.python.org/mailman/listinfo/python-list