Re: How can I create a list of programmes from BBC Sounds

Jeremy Nicoll - ml gip Wed, 24 May 2023 13:42:23 -0700

On 2023-05-20 18:37, Budge wrote:

Using the pid for the entire archive I can get a list of the entire
archive but I have not found out how to sort the list by genre or
obtain a list of a single genre.


It looks to me as "genre" is an arbitrary tag that the BBC do not expose
on any of the html pages - it must exist in their database of programmes
but nowhere else.

Can anybody please suggest how I might obtain the list.


Use curl or wget to request all the pages that one could fetch manually
for a specific genre, eg for science:

 https://www.bbc.co.uk/programmes/p01gyd7j?page=1
 https://www.bbc.co.uk/programmes/p01gyd7j?page=2
 https://www.bbc.co.uk/programmes/p01gyd7j?page=3
 ...

until you get the "page not found" page. Then examine the html andextract

the episode name, pid (and maybe episode description) for each one.

It looks easy(ish) for anyone who can write a computer program; thedrawbackof this approach is that you have to study the html quite carefully tofind

how its internal structure replicates for each entry on a page.

But broadly speaking it'll be a whole load of html for all the stuffthat'sat the start of a page, then stuff for the start of that page's list,then

the entries (though that may also be more complex than needed because of

the way that they get laid out in rows), then the end of that page'slist,

then the stuff at the end of every page.

How you then identify those areas on an html page depends greatly on the
programming language you use, and how complex you want your code to be.

There's also a problem that the (no doubt) machine-generated html onthese

pages will quite likely change its layout quite often.

So ... if you need to run your indexer fairly often (to pick up newentriesor just to check that it still produces the same answers as last time)youneed ideally to have your scanning program able to test if itsassumptions

about the layout of the entries is still reasonable.

Adding to the problem, perhaps, is that html doesn't need to be arrangedin

a file in the sort of line-by-line layout that one might see if - in a

browser - one does a "view source" for a page. Things that one day seemtobe on two consecutive lines might on another day be more or less spreadout.

In some cases when I've extracted stuff from html pages I've started offby

eg replacing long runs of repeated spaces by single spaces, and removed
completely some parts of the html because - for what I wanted - it just

muddied the water. In some cases it made more sense to introduce morelinebreaks so the file I then scanned had many more, but shorter, lines thanthe

html that I got back from the web server.

But for example, on page 1 of the Science lists, assuming that none ofthe

relevant links span a line-break, there's 25 occurrences of

 href="https://www.bbc.co.uk/programmes/

nearly all of them occurring at 45 line intervals. The first one is inthe"<head>" section of the page, so that leaves 24 of them in the <body>part,

corresponding to the 24 programmes described.

The place where two consecutive lines are not 45 lines apart occurs atthe

last row where one episode is a repeat and doesn't have a "play" button.

You'd need to see how its html doesn't follow the structure of theentries

that do have such a button, and take that into account in any code you
write (AND look out for other unexpected differences).


Those 24 literals all have a pid in them so really look like eg

       href="https://www.bbc.co.uk/programmes/m001l291";
       href="https://www.bbc.co.uk/programmes/m001jc68";
       href="https://www.bbc.co.uk/programmes/m001hnlf";
       ...

I'd /guess/ that one could eg take such a file, skip past its first 300orso lines (the first meaningful pid line is around line 330 at themoment,

then repeatedly scan forwards looking for

 href="https://www.bbc.co.uk/programmes/    and

read the pid that immediately follows that, then scan forwards for what
marks the start of the episode name (but not scan more than - say - ten
lines if - at the moment - you'd expect to find episode name in the next
(say) 5 lines, then similarly scan for the episode description's start.

Repeat until you fall off the end of that page's list.

If you can't program, eg in any version of BASIC, or python or perl or...

anything, maybe this would be a good time to learn how to?  Your code
would not need to be elegant or sophisticated ... just work.


For all I know, there may be utilities into which one could drag an
html page, and then manipulate it reasonably easily to extract the data
you want.  The trouble is, I don't know my way around tools that I don't
use.


I /do/ use a programmers' text editor; that's what showed me at a
glance that the instances of

 href="https://www.bbc.co.uk/programmes/

are at a specific repeating interval (though I expected that they
would be, more or less).  I'm sure that some other editors would show
the same thing but in different ways - many would probably let you
find successive such lines, but not simultaneously show how far apart
they all are.

Any sophisticated text editor has a steep learning curve and - if eg
you only use a very basic one - it's hard to know whether you'd benefit
from acquiring another one, and impossible to recommend one.  Typically,
when I've periodically looked at others to see what they offer, some
feel "right" but not versatile enough, and some feel "alien" in some
way and I never explore what they can do, and some are more versatile
than what I use now, but also far too complex.

Some - including the one I use - are scriptable.  Often that just means

that you can tell it to save a temporary copy of the file you'reediting,

run an external program against it, then display the results (eg in
another tab) - which is better than nothing, but limiting.

Mine though has a programming language built into it and that canexamine

the data that the editor is displaying (so it avoids the overhead of
making an external copy and later reading results), and manipulate it.
But to do that, you need to know not just the programming language, but

also the internal form of commands that otherwise one might justnormallyissue via menus in the editor's interface, and how to ask the editorwhat

those commands did.  I've used this editor (& a similar-looking one with
a different command-set) for over 30 years.

Bear in mind that if you fetch the html for such a page you don't getall

the javascript, images, CSS, etc that make what a browser sees and does
so complex.  That can make this sort of thing really hard to do (if how
a page behaves depends on both CSS and Javascript) but in fact all the
info you need on one of these pages IS in just the html.

--
Jeremy Nicoll - my opinions are my own

_______________________________________________
get_iplayer mailing list
get_iplayer@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/get_iplayer

Re: How can I create a list of programmes from BBC Sounds

Reply via email to