Hi Colin,

This seems quite good.

However a quick question traverse my mind while reading your wins points 
bellow:


Is something done to aggregate nearly similar infobox properties under 
common reference name?

Example of different properties that could appear in an infobox that 
have the same  role:

"birthday"
"data_of_birth"


and so on.

So, do you create distinct properties for each string, or you do some 
processing to normalize them?



Thanks,


Take care,


Fred
>  but a processed form of Wikipedia that makes it much 
> easier for anyone to do data mining of Wikipedia.  We've built a lot of 
> information extraction and data mining tools over the WEX corpus, and we 
> thought that it might be useful to other folks who are working on 
> similar problems. 
>
> The big wins that we've seen in using WEX for data mining and 
> information extraction are:
>
> - XML formatting instead of MediaWiki markup of articles -- this makes 
> writing scraper scripts easy, as a lot of the MediaWiki markup is 
> gnarly, especially tables and templates. Regular and valid XML 
> formatting means that you don't have to write a complicated parser or 
> ugly regular expressions.
>
> - RDBMS format means that you can plug it into Postgres and start 
> cranking out queries. Also, Postgres 8.3 has built-in XPath support, so 
> you can query the XML articles using just the database now. In practice, 
> this is a very fast way to start mining Wikipedia.
>
> - Bi-monthly releases -- Currently, you will have a hard time getting 
> more frequent releases of Wikipedia without scraping the website.  
> Metaweb pays for a live update feed from Wikipedia.
>
> - Reconciliation with Freebase.com -- Want to know which Wikipedia 
> articles are about US Presidents or which articles are about 
> Oscar-winning movies? Look up the guids on freebase, and then join them 
> against the article table. It is that easy.
>
> As for open sourcing other projects at Metaweb, we're interested in 
> giving away as much as we can.  We're serious about our commitment to 
> open source and open data. 
>
> Right now, our extraction frameworks are fairly ad-hoc and run 
> comfortably in our internal infrastructure, so they wouldn't make much 
> sense or be very useful to the outside community.  We will be able to 
> package up and give away more of our work as we move forward, though.
>
> If you've got more questions about WEX or other projects at 
> Freebase.com, please drop by our developer list and ask away: 
> http://lists.freebase.com/mailman/listinfo/developers
>
> Thanks!
> Colin Evans
>
>
>
> Georgi Kobilarov wrote:
>   
>> Hi all,
>>
>> Freebase now provides dumps of their data extracted from Wikipedia. See
>> [1] [2]. Interesting stuff. It is nice to see that Metaweb follows the
>> ideas of DBpedia ;)
>>
>> @Metaweb: it's time to open source your extraction framework as well. (I
>> know you read this :) 
>>
>>
>> Cheers,
>> Georgi
>>
>>
>> [1] http://blog.freebase.com/?p=108
>> [2] http://download.freebase.com/wex/
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>   
>>     
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to