Re: [Dbpedia-discussion] Freebase provides data dumps

Colin Evans Wed, 20 Feb 2008 13:03:56 -0800

Hi Fred,
The WEX corpus tries to make available the raw original data in 
relational form -- normalizing properties is a reconciliation step that 
we assume folks will do using the WEX corpus for their own information 
extraction projects.  The goal of WEX is to make it easier to get an 
information extraction project up and running.


Here's an example SQL query that traverses the Wikipedia template 
structure to find Abraham Lincoln's spouse:

SELECT template_values.xml FROM template_values
INNER JOIN template_calls ON call_id = template_calls.id
INNER JOIN articles ON articles.wpid = article_wpid
WHERE template_article_name = 'Template:Infobox Officeholder'
  AND template_values.name = 'spouse'
  AND articles.name = 'Abraham Lincoln'

http://download.freebase.com/wex/doc/#example_usage

The isn't a knowledge base; it is a relational and queryable rendering 
of Wikipedia's native structure, suitable for creating a knowledge base 
given some good algorithms.

Thanks
Colin



Frederick Giasson wrote:
> Hi Colin,
>
>
> This seems quite good.
>
> However a quick question traverse my mind while reading your wins 
> points bellow:
>
>
> Is something done to aggregate nearly similar infobox properties under 
> common reference name?
>
> Example of different properties that could appear in an infobox that 
> have the same  role:
>
> "birthday"
> "data_of_birth"
>
>
> and so on.
>
> So, do you create distinct properties for each string, or you do some 
> processing to normalize them?
>
>
>
> Thanks,
>
>
> Take care,
>
>
> Fred
>>  but a processed form of Wikipedia that makes it much easier for 
>> anyone to do data mining of Wikipedia.  We've built a lot of 
>> information extraction and data mining tools over the WEX corpus, and 
>> we thought that it might be useful to other folks who are working on 
>> similar problems.
>> The big wins that we've seen in using WEX for data mining and 
>> information extraction are:
>>
>> - XML formatting instead of MediaWiki markup of articles -- this 
>> makes writing scraper scripts easy, as a lot of the MediaWiki markup 
>> is gnarly, especially tables and templates. Regular and valid XML 
>> formatting means that you don't have to write a complicated parser or 
>> ugly regular expressions.
>>
>> - RDBMS format means that you can plug it into Postgres and start 
>> cranking out queries. Also, Postgres 8.3 has built-in XPath support, 
>> so you can query the XML articles using just the database now. In 
>> practice, this is a very fast way to start mining Wikipedia.
>>
>> - Bi-monthly releases -- Currently, you will have a hard time getting 
>> more frequent releases of Wikipedia without scraping the website.  
>> Metaweb pays for a live update feed from Wikipedia.
>>
>> - Reconciliation with Freebase.com -- Want to know which Wikipedia 
>> articles are about US Presidents or which articles are about 
>> Oscar-winning movies? Look up the guids on freebase, and then join 
>> them against the article table. It is that easy.
>>
>> As for open sourcing other projects at Metaweb, we're interested in 
>> giving away as much as we can.  We're serious about our commitment to 
>> open source and open data.
>> Right now, our extraction frameworks are fairly ad-hoc and run 
>> comfortably in our internal infrastructure, so they wouldn't make 
>> much sense or be very useful to the outside community.  We will be 
>> able to package up and give away more of our work as we move forward, 
>> though.
>>
>> If you've got more questions about WEX or other projects at 
>> Freebase.com, please drop by our developer list and ask away: 
>> http://lists.freebase.com/mailman/listinfo/developers
>>
>> Thanks!
>> Colin Evans
>>
>>
>>
>> Georgi Kobilarov wrote:
>>  
>>> Hi all,
>>>
>>> Freebase now provides dumps of their data extracted from Wikipedia. See
>>> [1] [2]. Interesting stuff. It is nice to see that Metaweb follows the
>>> ideas of DBpedia ;)
>>>
>>> @Metaweb: it's time to open source your extraction framework as 
>>> well. (I
>>> know you read this :)
>>>
>>> Cheers,
>>> Georgi
>>>
>>>
>>> [1] http://blog.freebase.com/?p=108
>>> [2] http://download.freebase.com/wex/
>>>
>>> ------------------------------------------------------------------------- 
>>>
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> Dbpedia-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>>       
>>
>>
>> ------------------------------------------------------------------------- 
>>
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>   
>


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Freebase provides data dumps

Reply via email to