This little Perl script outputs a TSV file containing the language
codes and names of various different Wikipedias.  It depends on
HTML::TableExtract and libwww-perl, both of which are free software
available from CPAN.

I used Perl for this rather than Python because I didn't know of an
equivalent to HTML::TableExtract in Python.  It turns out that someone
wrote something intended to be a Python equivalent of
HTML::TableExtract called TableParse, back in 2003, but I haven't
tried it.

Like everything else posted to kragen-hacks without any notice to the
contrary, this program is in the public domain; I abandon any
copyright in it.

#!/usr/bin/perl -w
use strict;
use HTML::TableExtract;
use LWP::Simple qw(get);

my $html;
my $fn = '/home/kragen/docs/List_of_Wikipedias';
if (open my $file, '<', $fn) {
  $html = do { local $/; <$file> };
} else {
  my $url = 'http://meta.wikimedia.org/wiki/List_of_Wikipedias';
  warn "Couldn't open $fn: $! --- trying $url";
  $html = get $url;
}

my $te = HTML::TableExtract->new( headers => [qw(Language Wiki)] );
# the rest of this is more or less from `perldoc HTML::TableExtract`
$te->parse($html);

foreach my $ts ($te->tables) {
  print "Table (", join(',', $ts->coords), "):\n";
  foreach my $row ($ts->rows) {
    print "\t", join("\t", @$row), "\n";
  }
}

Reply via email to