Philip Semanchuk wrote:
On Mar 5, 2009, at 12:31 PM, bruce wrote:

hi..

the url i'm focusing on is irrelevant to the issue i'm trying to solve at
this time.

Not if we're to understand the situation you're trying to describe. From what I can tell, you're saying that the target site displays different results each time your crawler visits it. It's as if e.g. the site knows about 100 courses but only displays 80 randomly chosen ones to each visitor. If that's the case, then it is truly bizarre.

    Agreed.  The course list isn't changing that rapidly.

    I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess.  Is that right?

    I've had to deal with that in Javascript.  My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers.   There, I have to watch for page-change events
and update the annotations I'm adding to ads.

    But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format.  See

        http://web-app.usc.edu/soc/dev/scripts/soc.js

Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)

The way to get USC class data is this:

1.  Start here: "http://web-app.usc.edu/soc/term_20091.html";
2.  Examine all the department pages under that page.
3.  On each page, look for the value of "coursesrc", like this:
        var coursesrc = '/ws/soc/api/classes/aest/20091'
4.  For each "coursesrc" value found, construct a URL like this:
        http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5.  Read that URL.  This will return the department's course list in
    JSON format.
6.  From the JSON tree, pull out CourseData items, which look like this:

CourseData":
{"prefix":"AEST",
"number":"220",
"sequence":"B",
"suffix":{},
"title":"Advanced Leadership Laboratory II",
"description":"Additional exposure to the military experience for continuing AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the environment of an Air Force officer. Credit\/No Credit.",
"units":"1",
"restriction_by_major":{},
"restriction_by_class":{},
"restriction_by_school":{},
"CourseNotes":{},
"CourseTermNotes":{},
"prereq_text":"AEST-220A",
"coreq_text":{},
"SectionData":{"id":"41799",
"session":"790",
"dclass_code":"D",
"title":"Advanced Leadership Laboratory II",
"section_title":{},
"description":{},
"notes":{},
"type":"Lec",
"units":"1",
"spaces_available":"30",
"number_registered":"2",
"wait_qty":"0",
"canceled":"N",
"blackboard":"Y",
"comment":{},
"day":{},"start_time":"TBA",
"end_time":"TBA",
"location":"OFFICE",
"instructor":{"last_name":"Hampton","first_name":"Daniel"},
"syllabus":{"format":{},"filesize":{}},
"IsDistanceLearning":"N"}}},

Parsing the JSON is left as an exercise for the student.  (There's
a Python module for that.)

And no, the data isn't changing; you can read those pages of JSON over and
over and get the same data every time.

                                        John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to