Re: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread Steve Jolly

Liam S Docherty wrote:

the current format of news articles do not
parse well at all, not to mention are rather difficult to extract from the
surrounding mark-up code.  The simplified version suffers from the same
problem, so I was wondering are there any nice html versions or even plain
text versions of the news articles?  Or does anyone know of a way around
this problem?


Have you looked at the low graphics version? eg

http://news.bbc.co.uk/1/low/world/americas/6918490.stm

S
-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


Re: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread ~:'' ありがとうございました 。

Liam,

I'm having a similar issue in that I wish to parse to SVG, and this  
could be easier...


in fact in large part it's a problem with the specifications...

cheers

Jonathan Chetwynd



On 27 Jul 2007, at 09:48, Liam S Docherty wrote:

The low graphics version sufffer from the same problem, in that the html
is not considered well formed by the standard Java parsers.  I suppose I
could try tidy up the html before parsing =)

Thanks

Liam




Have you looked at the low graphics version? eg

http://news.bbc.co.uk/1/low/world/americas/6918490.stm

S
-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe,  
please

visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.
Unofficial list archive:
http://www.mail-archive.com/backstage@lists.bbc.co.uk/




--
The University of Stirling is a university established in Scotland by
charter at Stirling, FK9 4LA.  Privileged/Confidential Information may
be contained in this message.  If you are not the addressee indicated
in this message (or responsible for delivery of the message to such
person), you may not disclose, copy or deliver this message to anyone
and any action taken or omitted to be taken in reliance on it, is
prohibited and may be unlawful.  In such case, you should destroy this
message and kindly notify the sender by reply email.  Please advise
immediately if you or your employer do not consent to Internet email
for messages of this kind.

-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe,  
please visit http://backstage.bbc.co.uk/archives/2005/01/ 
mailing_list.html.  Unofficial list archive: http://www.mail- 
archive.com/backstage@lists.bbc.co.uk/


-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


Re: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread Liam S Docherty
The low graphics version sufffer from the same problem, in that the html
is not considered well formed by the standard Java parsers.  I suppose I
could try tidy up the html before parsing =)

Thanks

Liam



 Have you looked at the low graphics version? eg

 http://news.bbc.co.uk/1/low/world/americas/6918490.stm

 S
 -
 Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please
 visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.
 Unofficial list archive:
 http://www.mail-archive.com/backstage@lists.bbc.co.uk/



-- 
The University of Stirling is a university established in Scotland by
charter at Stirling, FK9 4LA.  Privileged/Confidential Information may
be contained in this message.  If you are not the addressee indicated
in this message (or responsible for delivery of the message to such
person), you may not disclose, copy or deliver this message to anyone
and any action taken or omitted to be taken in reliance on it, is
prohibited and may be unlawful.  In such case, you should destroy this
message and kindly notify the sender by reply email.  Please advise
immediately if you or your employer do not consent to Internet email
for messages of this kind.

-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


Re: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread Matthew Somerville

Liam S Docherty wrote:

The low graphics version sufffer from the same problem, in that the html
is not considered well formed by the standard Java parsers.  I suppose I
could try tidy up the html before parsing =)


In this situation, I'd always suggest BeautifulSoup, but I'm afraid that's 
Python, not Java. But in case it's useful anyway, here's a simple script 
that extracts the heading and body of a BBC news story with it:


--8-
#!/usr/local/bin/python

from BeautifulSoup import BeautifulSoup
import urllib

f = urllib.urlopen('http://news.bbc.co.uk/1/hi/uk_politics/6918266.stm')
html = f.read()
f.close()

soup = BeautifulSoup(html)
table = soup.findAll('table', width=629)[1]
heading = table.find('div', {'class':'sh'}) # Perhaps mxb?
body = table.findAll('tr')[1].find('td')
crufts = body.findAll('div', {'class':'mvtb'})
[cruft.extract() for cruft in crufts]
print %s\n\n%s % (heading.renderContents(), body.renderContents())
--8-

ATB,
Matthew
--
http://www.dracos.co.uk/

-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


Re: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread Kim Plowright
 I understand that the BBC tracks external links in order to provide stats to
 respond to the Graf report's requirement for the BBC to link externally more
 often, and become part of the web.

That's done with something in the footer, which automagically rewrites
external links to have go tracking on, iirc.

ie, puts that funny
http://bbc.co.uk/go/tag/anothertag/tag3/-/externalsite.com/blah/page.html
url in.
-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


Re: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread Jason Cartwright
I'd imagine stats on which story is clicked is quite valuable, particularly
when moreover are ranking the stories.

I understand that the BBC tracks external links in order to provide stats to
respond to the Graf report's requirement for the BBC to link externally more
often, and become part of the web. This is probably a goal of the Web 2.0
stuff going on around the BBC as well.

J


On 27/7/07 10:19, Sean Dillon [EMAIL PROTECTED] wrote:

 http://news.bbc.co.uk/1/low/world/americas/6918490.stm
 
 I know this is totally off topic but I notice that the links to external
 stories are actually being redirected through moreover.com rather than
 link directly to the site in question (even if it does go through the
 internal Beeb redirect tracker)
 
 Is anyone aware of any reason why they do not link directly to the story
 on the relevant site instead?
 
 Cheers
 
 Seán
 
 -
 Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please
 visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.
 Unofficial list archive:
 http://www.mail-archive.com/backstage@lists.bbc.co.uk/


-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


RE: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread Jeremy Stone



Is anyone aware of any reason why they do not link directly to the story 
on the relevant site instead?

The journalists working on the relevant news story pick the related link to 
publish alongside their piece. However they use a tool to help them in this 
task where stories are suggested to them (from thousands of sources). This tool 
is based on a feed of worldwide news sources (online newspapers etc) supplied 
to the BBC by Moreover. BBC News have been using it for a number of years.





Re: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread Sean Dillon

http://news.bbc.co.uk/1/low/world/americas/6918490.stm


I know this is totally off topic but I notice that the links to external 
stories are actually being redirected through moreover.com rather than 
link directly to the site in question (even if it does go through the 
internal Beeb redirect tracker)


Is anyone aware of any reason why they do not link directly to the story 
on the relevant site instead?


Cheers

Seán

-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


RE: [backstage] Plain text or easy-to-parse news articles

2007-07-27 Thread Chris Yanda
 

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Steve Jolly
Sent: 27 July 2007 08:42
To: backstage@lists.bbc.co.uk
Subject: Re: [backstage] Plain text or easy-to-parse news articles

Liam S Docherty wrote:
 the current format of news articles do not parse well at all, not to 
 mention are rather difficult to extract from the surrounding mark-up 
 code.  The simplified version suffers from the same problem, 
so I was 
 wondering are there any nice html versions or even plain 
text versions 
 of the news articles?  Or does anyone know of a way around this 
 problem?

Have you looked at the low graphics version? eg

http://news.bbc.co.uk/1/low/world/americas/6918490.stm


You could also try the mobile xhtml version which should be more
compliant. E.g.

http://news.bbc.co.uk/mobile/bbc_news/top_stories/691/69184/story6918490
.shtml?

For those who are interested there are also RSS 2.0 feeds which points
to these versions:
http://news.bbc.co.uk/mobile/bbc_news/top_stories/rss20xhtml.xml
There is a feed for each directory on the mobile xhtml site all
following the same URL format. E.g.
http://news.bbc.co.uk/mobile/bbc_news/england/nwyl/north/cumbria/rss20xh
tml.xml

The site itself is at http://www.bbc.co.uk/mobile/index.shtml


Cheers,

- Chris





-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/