TODO: Organize these somehow, add one-line blurbs
Organize by usage? (classification, recommendation etc.)
Collections of Collections
Categorization Data
Recommendation Data
Multilingual Data
- http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php - 308,000 subtitle files covering about 18,900 movies in 59 languages (July 2006 numbers). This is a curated collection of subtitles from an aggregation site, http://www.openSubTitles.org
The original site, OpenSubtitles.org, is up to 1.6m subtitles files.
- Statistical Machine Translation - devoted to all things language translation. Includes multilingual corpuses of European and Canadian legal tomes.
Geospatial
- Natural Earth Data
- Open Street Maps
And other crowd-sourced mapping data sites.
Airline
- Open Flights - Crowd-sourced database of airlines, flights, airports, times, etc.
- Airline on-time information - 1987-2008 - 120m CSV records, 12G uncompressed
General Resources
- theinfo
- WordNet
- Common Crawl - freely available web crawl on EC2
Stuff