Hi Colin,
This seems quite good. However a quick question traverse my mind while reading your wins points bellow: Is something done to aggregate nearly similar infobox properties under common reference name? Example of different properties that could appear in an infobox that have the same role: "birthday" "data_of_birth" and so on. So, do you create distinct properties for each string, or you do some processing to normalize them? Thanks, Take care, Fred > but a processed form of Wikipedia that makes it much > easier for anyone to do data mining of Wikipedia. We've built a lot of > information extraction and data mining tools over the WEX corpus, and we > thought that it might be useful to other folks who are working on > similar problems. > > The big wins that we've seen in using WEX for data mining and > information extraction are: > > - XML formatting instead of MediaWiki markup of articles -- this makes > writing scraper scripts easy, as a lot of the MediaWiki markup is > gnarly, especially tables and templates. Regular and valid XML > formatting means that you don't have to write a complicated parser or > ugly regular expressions. > > - RDBMS format means that you can plug it into Postgres and start > cranking out queries. Also, Postgres 8.3 has built-in XPath support, so > you can query the XML articles using just the database now. In practice, > this is a very fast way to start mining Wikipedia. > > - Bi-monthly releases -- Currently, you will have a hard time getting > more frequent releases of Wikipedia without scraping the website. > Metaweb pays for a live update feed from Wikipedia. > > - Reconciliation with Freebase.com -- Want to know which Wikipedia > articles are about US Presidents or which articles are about > Oscar-winning movies? Look up the guids on freebase, and then join them > against the article table. It is that easy. > > As for open sourcing other projects at Metaweb, we're interested in > giving away as much as we can. We're serious about our commitment to > open source and open data. > > Right now, our extraction frameworks are fairly ad-hoc and run > comfortably in our internal infrastructure, so they wouldn't make much > sense or be very useful to the outside community. We will be able to > package up and give away more of our work as we move forward, though. > > If you've got more questions about WEX or other projects at > Freebase.com, please drop by our developer list and ask away: > http://lists.freebase.com/mailman/listinfo/developers > > Thanks! > Colin Evans > > > > Georgi Kobilarov wrote: > >> Hi all, >> >> Freebase now provides dumps of their data extracted from Wikipedia. See >> [1] [2]. Interesting stuff. It is nice to see that Metaweb follows the >> ideas of DBpedia ;) >> >> @Metaweb: it's time to open source your extraction framework as well. (I >> know you read this :) >> >> >> Cheers, >> Georgi >> >> >> [1] http://blog.freebase.com/?p=108 >> [2] http://download.freebase.com/wex/ >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by: Microsoft >> Defy all challenges. Microsoft(R) Visual Studio 2008. >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ >> _______________________________________________ >> Dbpedia-discussion mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >> >> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Dbpedia-discussion mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Dbpedia-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
