On Wed, 2009-04-08 at 01:57 +0100, Rhodri James wrote: > On Tue, 07 Apr 2009 12:46:18 +0100, J. Clifford Dyer > <j...@sdf.lonestar.org> wrote: > > > On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote: > >> On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain > >> <anishchapag...@gmail.com> wrote: > >> > >> > Hi, > >> > I was trying to extract wikipedia Infobox contents which is in format > >> > like given below, from the opened URL page in Python. > >> > > >> > {{ Infobox Software > >> > | name = Bash > [snip] > >> > | latest release date = {{release date|mf=yes|2009|02|20}} > >> > | programming language = [[C (programming language)|C]] > >> > | operating system = [[Cross-platform]] > >> > | platform = [[GNU]] > >> > | language = English, multilingual ([[gettext]]) > >> > | status = Active > [snip some more] > >> > }} //upto this line > >> > > >> > I need to extract all data between {{ Infobox ...to }} > > [snip still more] > > >> You end up with 'infoboxes' containing a list of all the infoboxes > >> on the page, each held as a list of the lines of their content. > >> For safety's sake you really should be using regular expressions > >> rather than 'startswith', but I leave that as an exercise for the > >> reader :-) > >> > > > > I agree that startswith isn't the right option, but for matching two > > constant characters, I don't think re is necessary. I'd just do: > > > > if '}}' in line: > > pass > > > > Then, as the saying goes, you only have one problem. > > That would be the problem of matching lines like: > > | latest release date = {{release date|mf=yes|2009|02|20}} > > would it? :-) >
That's the one. > A quick bit of timing suggests that: > > if line.lstrip().startswith("}}"): > pass > > is what we actually want. > Indeed. Thanks. -- http://mail.python.org/mailman/listinfo/python-list