Note: there are 2 forwarded emails here. I sent the 2nd one to Sam only,
instead of the list, by mistake.

X

-------- Original Message --------
Subject: Lecture List source data
Date: Tue, 13 Oct 2009 10:16:16 +0100
From: Ximin Luo <xl...@cam.ac.uk>
To: reporter.edi...@admin.cam.ac.uk

Hi,

I'm with a group of a few students who are planning to build a web-based
timetable application for arranging supervisions and other university-related
stuff. The idea is for students to be able to select which tripos / courses
they are doing and have this data automatically be added to their timetable.

Unfortunately, the Reporter's Lecture-List isn't easily computer-readable;
there are various quirks and inconsistencies which make it very complicated to
process directly from the PDFs. Do you have this data in a simpler format?

Thanks,

Ximin

-------- Original Message --------
Subject: Re: [Pidge-dev] good work
Date: Mon, 12 Oct 2009 19:23:36 +0100
From: Ximin Luo <xl...@cam.ac.uk>
To: Sam Davyson <samdavy...@gmail.com>
References: <20091007170651.18384.97...@slice.fergusrossferrier.co.uk>  
<d896124c0910080813w1a66c34dsf04e8729a7849...@mail.gmail.com>   
<4ad0ac72.1000...@cam.ac.uk>
<354cb040910111627p567e9745v209c45c2b79f0...@mail.gmail.com>

several major issues on "parsing the lecture-lists".

- some subjects (eg. SPS) don't publish tables; instead they link to their own
website. we'll need to write custom parsers for these.

- lots of courses make cross references to each other in their scheduling, such
as "first 8 lectures" / "last 16 lectures". this can be detected manually and
accounted for, but it does mean we can't just parse every course individually;
we need to keep track of its context too.

- there are many typographical mistakes that make it a bitch to parse strings.
for example, we have things like "TRIPO S", etc. this is possible to account
for but would be a bitch to code.

- the document structure and titles are inconsistent; we get things like
"ENGLISH TRIPOS, PART I" which is fine, but we also have things like "C
COURSES", etc. The layout is such that it's impossible *in general* (even for a
human) to work out which ones are sub-parts and sub-sub-parts. One option would
be to remove these groupings entirely and only have per-course data, but that
would massively inconvenience the end user.

All-in-all, the situation to do with global lecture lists is a total mess.
Ideally we would persuade the faculties to use a more consistent timetabling
system, but it's unlikely that any of them will listen to us.

I think the best thing to do for now, is to ask whoever writes the reporter, to
provide us the source of the data. Hopefully it will be slightly cleaner. I'll
go do some more research in this direction.

X

_______________________________________________
Mailing list: https://launchpad.net/~pidge-dev
Post to     : pidge-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~pidge-dev
More help   : https://help.launchpad.net/ListHelp

Reply via email to