simple: screen-scraping with wget and lxml and PyQuery in Python, for ranking top universities

Kragen Javier Sitaker Thu, 05 Jul 2012 09:52:54 -0700

We were having a discussion about what, if anything, the USA is still #1 in the
world at — aside from deeply unfortunate things like the number of rapes, the
number of car thefts, the number of prison inmates, military spending, and so
on.


It occurred to me that the US is pretty indisputably #1 in hosting top
universities.  But the rules of the discussion required that you present
credible evidence.  So I screen-scraped the US News and World Report university
rankings web pages, with this one-line shell script:

    for page in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16; do wget 
http://www.usnews.com/education/worlds-best-universities-rankings/top-400-universities-in-the-world?page=$page;
 done

I wanted to write a program "top" that would extract just the names of the
countries, so that I could do this to rank the countries:

    ./top top-400-universities-in-the-world\?page\=* | sort | uniq -c | sort -nr

But how to extract the country names?

The pages contain chunks of HTML that look more or less like this:

    <td class="row-odd row-first column-odd   university-name">
        <div>
            <a target="blank" 
href="http://www.topuniversities.com/institution/university-cambridge";>University
 of Cambridge</a>
            <p>United Kingdom</p>
        </div>
    </td>

This makes it a little bit difficult to extract out the country name with just
a regular expression, since the countries are among many things that are on a
line by themselves in a `<p>`.  With jQuery, it's easy, but jQuery is not easy
to use from the command line.  In the past, I would have used BeautifulSoup,
but Leonard has stopped maintaining BeautifulSoup, so I figured it was probably
time to switch.  And I remembered that there's a library called "PyQuery",
which is an implementation of much of jQuery, but for Python, so it's easy to
invoke from the command line.

It was a bit tricky to get PyQuery working properly, but after some
experimentation, I ended up with a very simple script:

    #!/usr/bin/python
    """Find the countries of the top universities from pages such as

    
<http://www.usnews.com/education/worlds-best-universities-rankings/top-400-universities-in-the-world?page=1>

    """
    import pyquery, lxml.html, sys

    def query(filename):
        return 
pyquery.PyQuery(lxml.html.document_fromstring(open(filename).read()))

    def countries(filename):
        d = query(filename)
        return (p.text for p in d('.university-name p'))

    if __name__ == '__main__':
        for filename in sys.argv:
            for country in countries(filename):
                print country

Oh, and the results:

     85 United States
     43 United Kingdom
     36 Germany
     21 Australia
     19 France
     17 Canada
     16 Japan
     12 Netherlands
     10 South Korea
      9 Italy
      9 China
      8 Switzerland
      8 Spain
      7 Sweden
      7 Finland
      7 Belgium
      6 Taiwan
      6 New Zealand
      6 India
      6 Hong Kong
      5 Russia
      5 Ireland
      5 Denmark
      4 Norway
      4 Malaysia
      4 Israel
      4 Austria
      3 Saudi Arabia
      3 Brazil
      3 Argentina
      2 Thailand
      2 South Africa
      2 Singapore
      2 Philippines
      2 Mexico
      2 Indonesia
      2 Chile
      1 United Arab Emirates
      1 Portugal
      1 Poland
      1 Oman
      1 Lebanon
      1 Hungary
      1 Greece
      1 Czech Republic

-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-hacks

simple: screen-scraping with wget and lxml and PyQuery in Python, for ranking top universities

Reply via email to