Hi all, I'm happy to hear that there is some interest in our Wikipedia Extraction project!
To clarify, the Freebase WEX (Wikipedia EXtraction) dumps are not harvested data, but a processed form of Wikipedia that makes it much easier for anyone to do data mining of Wikipedia. We've built a lot of information extraction and data mining tools over the WEX corpus, and we thought that it might be useful to other folks who are working on similar problems. The big wins that we've seen in using WEX for data mining and information extraction are: - XML formatting instead of MediaWiki markup of articles -- this makes writing scraper scripts easy, as a lot of the MediaWiki markup is gnarly, especially tables and templates. Regular and valid XML formatting means that you don't have to write a complicated parser or ugly regular expressions. - RDBMS format means that you can plug it into Postgres and start cranking out queries. Also, Postgres 8.3 has built-in XPath support, so you can query the XML articles using just the database now. In practice, this is a very fast way to start mining Wikipedia. - Bi-monthly releases -- Currently, you will have a hard time getting more frequent releases of Wikipedia without scraping the website. Metaweb pays for a live update feed from Wikipedia. - Reconciliation with Freebase.com -- Want to know which Wikipedia articles are about US Presidents or which articles are about Oscar-winning movies? Look up the guids on freebase, and then join them against the article table. It is that easy. As for open sourcing other projects at Metaweb, we're interested in giving away as much as we can. We're serious about our commitment to open source and open data. Right now, our extraction frameworks are fairly ad-hoc and run comfortably in our internal infrastructure, so they wouldn't make much sense or be very useful to the outside community. We will be able to package up and give away more of our work as we move forward, though. If you've got more questions about WEX or other projects at Freebase.com, please drop by our developer list and ask away: http://lists.freebase.com/mailman/listinfo/developers Thanks! Colin Evans Georgi Kobilarov wrote: > Hi all, > > Freebase now provides dumps of their data extracted from Wikipedia. See > [1] [2]. Interesting stuff. It is nice to see that Metaweb follows the > ideas of DBpedia ;) > > @Metaweb: it's time to open source your extraction framework as well. (I > know you read this :) > > > Cheers, > Georgi > > > [1] http://blog.freebase.com/?p=108 > [2] http://download.freebase.com/wex/ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Dbpedia-discussion mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
