On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote: > On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain > <anishchapag...@gmail.com> wrote: > > > Hi, > > I was trying to extract wikipedia Infobox contents which is in format > > like given below, from the opened URL page in Python. > > > > {{ Infobox Software > > | name = Bash > > | logo = [[Image:bash-org.png|165px]] > > | screenshot = [[Image:Bash demo.png|250px]] > > | caption = Screenshot of bash and [[Bourne shell|sh]] > > sessions demonstrating some features > > | developer = [[Chet Ramey]] > > | latest release version = 4.0 > > | latest release date = {{release date|mf=yes|2009|02|20}} > > | programming language = [[C (programming language)|C]] > > | operating system = [[Cross-platform]] > > | platform = [[GNU]] > > | language = English, multilingual ([[gettext]]) > > | status = Active > > | genre = [[Unix shell]] > > | source model = [[Free software]] > > | license = [[GNU General Public License]] > > | website = [http://tiswww.case.edu/php/chet/bash/ > > bashtop.html Home page] > > }} //upto this line > > > > I need to extract all data between {{ Infobox ...to }} > > > > Thank's if anyone can help, > > am trying with > > > > s1='{{ Infobox' > > s2=len(s1) > > pos1=data.find("{{ Infobox") > > pos2=data.find("\n",pos2) > > > > pat1=data.find("}}") > > > > but am ending up getting one line at top only. > > How are you getting your data? Assuming that you can arrange to get > it one line at a time, here's a quick and dirty way to extract the > infoboxes on a page. > > infoboxes = [] > infobox = [] > reading_infobox = False > > for line in feed_me_lines_somehow(): > if line.startswith("{{ Infobox"): > reading_infobox = True > if reading_infobox: > infobox.append(line) > if line.startswith("}}"): > reading_infobox = False > infoboxes.append(infobox) > infobox = [] > > You end up with 'infoboxes' containing a list of all the infoboxes > on the page, each held as a list of the lines of their content. > For safety's sake you really should be using regular expressions > rather than 'startswith', but I leave that as an exercise for the > reader :-) >
I agree that startswith isn't the right option, but for matching two constant characters, I don't think re is necessary. I'd just do: if '}}' in line: pass Then, as the saying goes, you only have one problem. Cheers, Cliff -- http://mail.python.org/mailman/listinfo/python-list