Hi all,
I'm happy to hear that there is some interest in our Wikipedia 
Extraction project!

To clarify, the Freebase WEX (Wikipedia EXtraction) dumps are not 
harvested data, but a processed form of Wikipedia that makes it much 
easier for anyone to do data mining of Wikipedia.  We've built a lot of 
information extraction and data mining tools over the WEX corpus, and we 
thought that it might be useful to other folks who are working on 
similar problems. 

The big wins that we've seen in using WEX for data mining and 
information extraction are:

- XML formatting instead of MediaWiki markup of articles -- this makes 
writing scraper scripts easy, as a lot of the MediaWiki markup is 
gnarly, especially tables and templates. Regular and valid XML 
formatting means that you don't have to write a complicated parser or 
ugly regular expressions.

- RDBMS format means that you can plug it into Postgres and start 
cranking out queries. Also, Postgres 8.3 has built-in XPath support, so 
you can query the XML articles using just the database now. In practice, 
this is a very fast way to start mining Wikipedia.

- Bi-monthly releases -- Currently, you will have a hard time getting 
more frequent releases of Wikipedia without scraping the website.  
Metaweb pays for a live update feed from Wikipedia.

- Reconciliation with Freebase.com -- Want to know which Wikipedia 
articles are about US Presidents or which articles are about 
Oscar-winning movies? Look up the guids on freebase, and then join them 
against the article table. It is that easy.

As for open sourcing other projects at Metaweb, we're interested in 
giving away as much as we can.  We're serious about our commitment to 
open source and open data. 

Right now, our extraction frameworks are fairly ad-hoc and run 
comfortably in our internal infrastructure, so they wouldn't make much 
sense or be very useful to the outside community.  We will be able to 
package up and give away more of our work as we move forward, though.

If you've got more questions about WEX or other projects at 
Freebase.com, please drop by our developer list and ask away: 
http://lists.freebase.com/mailman/listinfo/developers

Thanks!
Colin Evans



Georgi Kobilarov wrote:
> Hi all,
>
> Freebase now provides dumps of their data extracted from Wikipedia. See
> [1] [2]. Interesting stuff. It is nice to see that Metaweb follows the
> ideas of DBpedia ;)
>
> @Metaweb: it's time to open source your extraction framework as well. (I
> know you read this :) 
>
>
> Cheers,
> Georgi
>
>
> [1] http://blog.freebase.com/?p=108
> [2] http://download.freebase.com/wex/
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to