We were having a discussion about what, if anything, the USA is still #1 in the
world at — aside from deeply unfortunate things like the number of rapes, the
number of car thefts, the number of prison inmates, military spending, and so
on.
It occurred to me that the US is pretty indisputably #1 in hosting top
universities. But the rules of the discussion required that you present
credible evidence. So I screen-scraped the US News and World Report university
rankings web pages, with this one-line shell script:
for page in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16; do wget
http://www.usnews.com/education/worlds-best-universities-rankings/top-400-universities-in-the-world?page=$page;
done
I wanted to write a program "top" that would extract just the names of the
countries, so that I could do this to rank the countries:
./top top-400-universities-in-the-world\?page\=* | sort | uniq -c | sort -nr
But how to extract the country names?
The pages contain chunks of HTML that look more or less like this:
<td class="row-odd row-first column-odd university-name">
<div>
<a target="blank"
href="http://www.topuniversities.com/institution/university-cambridge">University
of Cambridge</a>
<p>United Kingdom</p>
</div>
</td>
This makes it a little bit difficult to extract out the country name with just
a regular expression, since the countries are among many things that are on a
line by themselves in a `<p>`. With jQuery, it's easy, but jQuery is not easy
to use from the command line. In the past, I would have used BeautifulSoup,
but Leonard has stopped maintaining BeautifulSoup, so I figured it was probably
time to switch. And I remembered that there's a library called "PyQuery",
which is an implementation of much of jQuery, but for Python, so it's easy to
invoke from the command line.
It was a bit tricky to get PyQuery working properly, but after some
experimentation, I ended up with a very simple script:
#!/usr/bin/python
"""Find the countries of the top universities from pages such as
<http://www.usnews.com/education/worlds-best-universities-rankings/top-400-universities-in-the-world?page=1>
"""
import pyquery, lxml.html, sys
def query(filename):
return
pyquery.PyQuery(lxml.html.document_fromstring(open(filename).read()))
def countries(filename):
d = query(filename)
return (p.text for p in d('.university-name p'))
if __name__ == '__main__':
for filename in sys.argv:
for country in countries(filename):
print country
Oh, and the results:
85 United States
43 United Kingdom
36 Germany
21 Australia
19 France
17 Canada
16 Japan
12 Netherlands
10 South Korea
9 Italy
9 China
8 Switzerland
8 Spain
7 Sweden
7 Finland
7 Belgium
6 Taiwan
6 New Zealand
6 India
6 Hong Kong
5 Russia
5 Ireland
5 Denmark
4 Norway
4 Malaysia
4 Israel
4 Austria
3 Saudi Arabia
3 Brazil
3 Argentina
2 Thailand
2 South Africa
2 Singapore
2 Philippines
2 Mexico
2 Indonesia
2 Chile
1 United Arab Emirates
1 Portugal
1 Poland
1 Oman
1 Lebanon
1 Hungary
1 Greece
1 Czech Republic
--
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-hacks