Re: [backstage] Plain text or easy-to-parse news articles
Liam S Docherty wrote: the current format of news articles do not parse well at all, not to mention are rather difficult to extract from the surrounding mark-up code. The simplified version suffers from the same problem, so I was wondering are there any nice html versions or even plain text versions of the news articles? Or does anyone know of a way around this problem? Have you looked at the low graphics version? eg http://news.bbc.co.uk/1/low/world/americas/6918490.stm S - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
Re: [backstage] Plain text or easy-to-parse news articles
Liam, I'm having a similar issue in that I wish to parse to SVG, and this could be easier... in fact in large part it's a problem with the specifications... cheers Jonathan Chetwynd On 27 Jul 2007, at 09:48, Liam S Docherty wrote: The low graphics version sufffer from the same problem, in that the html is not considered well formed by the standard Java parsers. I suppose I could try tidy up the html before parsing =) Thanks Liam Have you looked at the low graphics version? eg http://news.bbc.co.uk/1/low/world/americas/6918490.stm S - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/ -- The University of Stirling is a university established in Scotland by charter at Stirling, FK9 4LA. Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not disclose, copy or deliver this message to anyone and any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer do not consent to Internet email for messages of this kind. - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/ mailing_list.html. Unofficial list archive: http://www.mail- archive.com/backstage@lists.bbc.co.uk/ - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
Re: [backstage] Plain text or easy-to-parse news articles
The low graphics version sufffer from the same problem, in that the html is not considered well formed by the standard Java parsers. I suppose I could try tidy up the html before parsing =) Thanks Liam Have you looked at the low graphics version? eg http://news.bbc.co.uk/1/low/world/americas/6918490.stm S - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/ -- The University of Stirling is a university established in Scotland by charter at Stirling, FK9 4LA. Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not disclose, copy or deliver this message to anyone and any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer do not consent to Internet email for messages of this kind. - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
Re: [backstage] Plain text or easy-to-parse news articles
Liam S Docherty wrote: The low graphics version sufffer from the same problem, in that the html is not considered well formed by the standard Java parsers. I suppose I could try tidy up the html before parsing =) In this situation, I'd always suggest BeautifulSoup, but I'm afraid that's Python, not Java. But in case it's useful anyway, here's a simple script that extracts the heading and body of a BBC news story with it: --8- #!/usr/local/bin/python from BeautifulSoup import BeautifulSoup import urllib f = urllib.urlopen('http://news.bbc.co.uk/1/hi/uk_politics/6918266.stm') html = f.read() f.close() soup = BeautifulSoup(html) table = soup.findAll('table', width=629)[1] heading = table.find('div', {'class':'sh'}) # Perhaps mxb? body = table.findAll('tr')[1].find('td') crufts = body.findAll('div', {'class':'mvtb'}) [cruft.extract() for cruft in crufts] print %s\n\n%s % (heading.renderContents(), body.renderContents()) --8- ATB, Matthew -- http://www.dracos.co.uk/ - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
Re: [backstage] Plain text or easy-to-parse news articles
I understand that the BBC tracks external links in order to provide stats to respond to the Graf report's requirement for the BBC to link externally more often, and become part of the web. That's done with something in the footer, which automagically rewrites external links to have go tracking on, iirc. ie, puts that funny http://bbc.co.uk/go/tag/anothertag/tag3/-/externalsite.com/blah/page.html url in. - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
Re: [backstage] Plain text or easy-to-parse news articles
I'd imagine stats on which story is clicked is quite valuable, particularly when moreover are ranking the stories. I understand that the BBC tracks external links in order to provide stats to respond to the Graf report's requirement for the BBC to link externally more often, and become part of the web. This is probably a goal of the Web 2.0 stuff going on around the BBC as well. J On 27/7/07 10:19, Sean Dillon [EMAIL PROTECTED] wrote: http://news.bbc.co.uk/1/low/world/americas/6918490.stm I know this is totally off topic but I notice that the links to external stories are actually being redirected through moreover.com rather than link directly to the site in question (even if it does go through the internal Beeb redirect tracker) Is anyone aware of any reason why they do not link directly to the story on the relevant site instead? Cheers Seán - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/ - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
RE: [backstage] Plain text or easy-to-parse news articles
Is anyone aware of any reason why they do not link directly to the story on the relevant site instead? The journalists working on the relevant news story pick the related link to publish alongside their piece. However they use a tool to help them in this task where stories are suggested to them (from thousands of sources). This tool is based on a feed of worldwide news sources (online newspapers etc) supplied to the BBC by Moreover. BBC News have been using it for a number of years.
Re: [backstage] Plain text or easy-to-parse news articles
http://news.bbc.co.uk/1/low/world/americas/6918490.stm I know this is totally off topic but I notice that the links to external stories are actually being redirected through moreover.com rather than link directly to the site in question (even if it does go through the internal Beeb redirect tracker) Is anyone aware of any reason why they do not link directly to the story on the relevant site instead? Cheers Seán - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
RE: [backstage] Plain text or easy-to-parse news articles
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Steve Jolly Sent: 27 July 2007 08:42 To: backstage@lists.bbc.co.uk Subject: Re: [backstage] Plain text or easy-to-parse news articles Liam S Docherty wrote: the current format of news articles do not parse well at all, not to mention are rather difficult to extract from the surrounding mark-up code. The simplified version suffers from the same problem, so I was wondering are there any nice html versions or even plain text versions of the news articles? Or does anyone know of a way around this problem? Have you looked at the low graphics version? eg http://news.bbc.co.uk/1/low/world/americas/6918490.stm You could also try the mobile xhtml version which should be more compliant. E.g. http://news.bbc.co.uk/mobile/bbc_news/top_stories/691/69184/story6918490 .shtml? For those who are interested there are also RSS 2.0 feeds which points to these versions: http://news.bbc.co.uk/mobile/bbc_news/top_stories/rss20xhtml.xml There is a feed for each directory on the mobile xhtml site all following the same URL format. E.g. http://news.bbc.co.uk/mobile/bbc_news/england/nwyl/north/cumbria/rss20xh tml.xml The site itself is at http://www.bbc.co.uk/mobile/index.shtml Cheers, - Chris - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/