This little Perl script outputs a TSV file containing the language
codes and names of various different Wikipedias. It depends on
HTML::TableExtract and libwww-perl, both of which are free software
available from CPAN.
I used Perl for this rather than Python because I didn't know of an
equivalent to HTML::TableExtract in Python. It turns out that someone
wrote something intended to be a Python equivalent of
HTML::TableExtract called TableParse, back in 2003, but I haven't
tried it.
Like everything else posted to kragen-hacks without any notice to the
contrary, this program is in the public domain; I abandon any
copyright in it.
#!/usr/bin/perl -w
use strict;
use HTML::TableExtract;
use LWP::Simple qw(get);
my $html;
my $fn = '/home/kragen/docs/List_of_Wikipedias';
if (open my $file, '<', $fn) {
$html = do { local $/; <$file> };
} else {
my $url = 'http://meta.wikimedia.org/wiki/List_of_Wikipedias';
warn "Couldn't open $fn: $! --- trying $url";
$html = get $url;
}
my $te = HTML::TableExtract->new( headers => [qw(Language Wiki)] );
# the rest of this is more or less from `perldoc HTML::TableExtract`
$te->parse($html);
foreach my $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach my $row ($ts->rows) {
print "\t", join("\t", @$row), "\n";
}
}