Note: there are 2 forwarded emails here. I sent the 2nd one to Sam only, instead of the list, by mistake.
X -------- Original Message -------- Subject: Lecture List source data Date: Tue, 13 Oct 2009 10:16:16 +0100 From: Ximin Luo <xl...@cam.ac.uk> To: reporter.edi...@admin.cam.ac.uk Hi, I'm with a group of a few students who are planning to build a web-based timetable application for arranging supervisions and other university-related stuff. The idea is for students to be able to select which tripos / courses they are doing and have this data automatically be added to their timetable. Unfortunately, the Reporter's Lecture-List isn't easily computer-readable; there are various quirks and inconsistencies which make it very complicated to process directly from the PDFs. Do you have this data in a simpler format? Thanks, Ximin -------- Original Message -------- Subject: Re: [Pidge-dev] good work Date: Mon, 12 Oct 2009 19:23:36 +0100 From: Ximin Luo <xl...@cam.ac.uk> To: Sam Davyson <samdavy...@gmail.com> References: <20091007170651.18384.97...@slice.fergusrossferrier.co.uk> <d896124c0910080813w1a66c34dsf04e8729a7849...@mail.gmail.com> <4ad0ac72.1000...@cam.ac.uk> <354cb040910111627p567e9745v209c45c2b79f0...@mail.gmail.com> several major issues on "parsing the lecture-lists". - some subjects (eg. SPS) don't publish tables; instead they link to their own website. we'll need to write custom parsers for these. - lots of courses make cross references to each other in their scheduling, such as "first 8 lectures" / "last 16 lectures". this can be detected manually and accounted for, but it does mean we can't just parse every course individually; we need to keep track of its context too. - there are many typographical mistakes that make it a bitch to parse strings. for example, we have things like "TRIPO S", etc. this is possible to account for but would be a bitch to code. - the document structure and titles are inconsistent; we get things like "ENGLISH TRIPOS, PART I" which is fine, but we also have things like "C COURSES", etc. The layout is such that it's impossible *in general* (even for a human) to work out which ones are sub-parts and sub-sub-parts. One option would be to remove these groupings entirely and only have per-course data, but that would massively inconvenience the end user. All-in-all, the situation to do with global lecture lists is a total mess. Ideally we would persuade the faculties to use a more consistent timetabling system, but it's unlikely that any of them will listen to us. I think the best thing to do for now, is to ask whoever writes the reporter, to provide us the source of the data. Hopefully it will be slightly cleaner. I'll go do some more research in this direction. X _______________________________________________ Mailing list: https://launchpad.net/~pidge-dev Post to : firstname.lastname@example.org Unsubscribe : https://launchpad.net/~pidge-dev More help : https://help.launchpad.net/ListHelp