Re: [Dbpedia-discussion] ANN: DBpedia 3.5 released

Anja Jentzsch Wed, 14 Apr 2010 08:07:54 -0700

Hi Nicolas,

the infobox dataset represents infobox properties which does not cover all
templates occuring in Wikipedia.
Until DBpedia 3.5 the infobox extraction was also covering templates like
'cite book' which have many occurrences but do not contain information on
the resource itself which is described on a wiki page. We reduced the
extracted information by excluding non-relevant templates.


Cheers,
Anja

On Tue, 13 Apr 2010 14:24:00 -0700, Nicolas Torzec <[email protected]>
wrote:
> Dear DBpedia workers,
> First of all, many thanks for this new release :)
> 
> Then, I have a quick question regarding the difference between the
dbpedia
> 3.4 raw infobox data set and the dbpedia 3.5 raw infobox data set.
> - http://downloads.dbpedia.org/3.5/en/infobox_properties_en.nt.bz2
> - http://downloads.dbpedia.org/3.4/en/infobox_en.nt.bz2
> 
> Comparing the two, it appears that the dbpedia 3.5 infobox data set
(4.7G)
> is actually much smaller than the dbpedia 3.4 infobox data set (5.7G).
>  
> Do you know why the trend is not size increase, but size reduction?
> Did you change anything in the way that raw infobox data sets are
> extracted?
>  
>  
> Cheers,
> Nicolas.
> 
> --
> Nicolas Torzec
> Yahoo! Labs.
> 
> 
>  
> 
>> 
>> On 4/12/10 2:06 AM, "Chris Bizer" <[email protected]> wrote:
>> 
>>> Hi all,
>>> 
>>> we are happy to announce the release of DBpedia 3.5.
>>> 
>>> The new release is based on Wikipedia dumps dating from March 2010.
>>> Compared
>>> to the 3.4 release, we were able to increase the quality of the DBpedia
>>> knowledge base by employing a new data extraction framework which
>>> applies
>>> various data cleansing heuristics as well as by extending the
>>> infobox-to-ontology mappings that guide the data extraction process.
>>> 
>>> The new DBpedia knowledge base describes more than 3.4 million things,
>>> out
>>> of which 1.47 million are classified in a consistent ontology,
including
>>> 312,000 persons, 413,000 places, 94,000 music albums, 49,000 films,
>>> 15,000
>>> video games, 140,000 organizations, 146,000 species and 4,600 diseases.
>>> The
>>> DBpedia data set features labels and abstracts for these 3.2 million
>>> things
>>> in up to 92 different languages; 1,460,000 links to images and
5,543,000
>>> links to external web pages; 4,887,000 external links into other RDF
>>> datasets, 565,000 Wikipedia categories, and 75,000 YAGO categories. The
>>> DBpedia knowledge base altogether consists of over 1 billion pieces of
>>> information (RDF triples) out of which 257 million were extracted from
>>> the
>>> English edition of Wikipedia and 766 million were extracted from other
>>> language editions.
>>> 
>>> The new release provides the following improvements and changes
compared
>>> to
>>> the DBpedia 3.4 release:
>>> 
>>> 1. The DBpedia extraction framework has been completely rewritten in
>>> Scala.
>>> The new framework dramatically reduces the extraction time of a single
>>> Wikipedia article from over 200 to about 13 milliseconds. All features
>>> of
>>> the previous PHP framework have been ported. In addition, the new
>>> framework
>>> can extract data from Wikipedia tables based on table-to-ontology
>>> mappings
>>> and is able to extract multiple infoboxes out of a single Wikipedia
>>> article.
>>> The data from each infobox is represented as a separate RDF resource.
>>> All
>>> resources that are extracted from a single page can be connected using
>>> custom RDF properties which are also defined in the mappings. A lot of
>>> work
>>> also went into the value parsers and the DBpedia 3.5 dataset should
>>> therefore be much cleaner than its predecessors. In addition, units of
>>> measurement are normalized to their respective SI unit, which makes
>>> querying
>>> DBpedia easier. 
>>> 
>>> 2. The mapping language that is used to map Wikipedia infoboxes to the
>>> DBpedia Ontology has been redesigned. The documentation of the new
>>> mapping
>>> language is found at
>>>
http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/trunk/extraction/core/doc/
>>> mapping%20language/
>>> 
>>> 3. In order to enable the DBpedia user community to extend and refine
>>> the
>>> infobox to ontology mappings, the mappings can be edited on the newly
>>> created wiki hosted on http://mappings.dbpedia.org. At the moment, 303
>>> template mappings are defined, which cover (including redirects) 1055
>>> templates. On the wiki, the DBpedia Ontology can be edited by the
>>> community
>>> as well. At the moment, the ontology consists of 259 classes and about
>>> 1,200
>>> properties.
>>>  
>>> 4. The ontology properties extracted from infoboxes are now split into
>>> two
>>> data sets: 1. The Ontology Infobox Properties dataset contains the
>>> properties as they are defined in the ontology (e.g. length). The range
>>> of a
>>> property is either an xsd schema type or a dimension of measurement, in
>>> which case the value is normalized to the respective SI unit. 2. The
>>> Ontology Infobox Properties (Specific) dataset contains properties
which
>>> have been specialized for a specific class using a specific unit. e.g.
>>> the
>>> property height is specialized on the class Person using the unit
>>> centimeters instead of meters. For further details please refer to
>>> http://wiki.dbpedia.org/Datasets#h18-11.
>>>  
>>> 5. The framework now resolves template redirects, making it possible to
>>> cover all redirects to an infobox on Wikipedia with a single mapping.
>>> 
>>> 6. Three new extractors have been implemented: 1. PageIdExtractor
>>> extracting
>>> Wikipedia page IDs are extracted for each page. 2. RevisionExtractor
>>> extracting the latest revision of a page. 3. PNDExtractor extracting
PND
>>> (Personnamendatei) identifiers.
>>> 
>>> 7. The data set now provides labels, abstracts, page links and infobox
>>> data
>>> in 92 different languages, which have been extracted from recent
>>> Wikipedia
>>> dumps as of March 2010.
>>> 
>>> 8. In addition the N-Triples datasets, N-Quads datasets are provided
>>> which
>>> include a provenance URI to each statement. The provenance URI denotes
>>> the
>>> origin of the extracted triple in Wikipedia (For details see:
>>> http://wiki.dbpedia.org/Datasets#h18-18).
>>> 
>>> You can download the new DBpedia dataset from
>>> http://wiki.dbpedia.org/Downloads35. As usual, the data set is also
>>> available as Linked Data and via the DBpedia SPARQL endpoint.
>>> 
>>> Lots of thanks to:
>>> 
>>> * Robert Isele, Anja Jentzsch, Christopher Sahnwaldt, and Paul Kreis
>>> (all
>>> Freie Universität Berlin) for reimplementing the DBpedia extraction
>>> framework in Scala, for extending the infobox-to-ontology mappings and
>>> for
>>> extracting the new DBpedia 3.5 knowledge base.
>>> 
>>> * Jens Lehmann and Sören Auer (both Universität Leipzig) for providing
>>> the
>>> knowledge base via the DBpedia download server at Universität Leipzig.
>>> 
>>> * Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading
>>> the
>>> knowledge base into the Virtuoso instance that serves the Linked Data
>>> view
>>> and SPARQL endpoint.
>>> 
>>> The whole DBpedia team is very thankful to three companies which
enabled
>>> us
>>> to do all this by supporting and sponsoring the DBpedia project:
>>> 
>>> * Neofonie GmbH (http://www.neofonie.de/index.jsp), a Berlin-based
>>> company
>>> offering leading technologies in the area of Web search, social media
>>> and
>>> mobile applications.
>>> 
>>> * Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan
>>> Inc.
>>> creates and advances a variety of world-class endeavors and high impact
>>> initiatives that change and improve the way we live, learn, do business
>>> (http://www.vulcan.com/).
>>> 
>>> * OpenLink Software (http://www.openlinksw.com/). OpenLink Software
>>> develops
>>> the Virtuoso Universal Server, an innovative enterprise grade server
>>> that
>>> cost-effectively delivers an unrivaled platform for Data Access,
>>> Integration
>>> and Management. 
>>> 
>>> More information about DBpedia is found at http://dbpedia.org/About
>>> 
>>> Have fun with the new DBpedia knowledge base!
>>> 
>>> Cheers, 
>>> 
>>> Chris Bizer
>>> 
>>> 
>>> --
>>> Prof. Dr. Christian Bizer
>>> Web-based Systems Group
>>> Freie Universität Berlin
>>> +49 30 838 55509
>>> http://www.bizer.de
>>> [email protected]
>>> 
>>> 
>>> 
>
----------------------------------------------------------------------------->>
> -
>>> Download Intel® Parallel Studio Eval
>>> Try the new software tools for yourself. Speed compiling, find bugs
>>> proactively, and fine-tune applications for parallel performance.
>>> See why Intel Parallel Studio got high marks during beta.
>>> http://p.sf.net/sfu/intel-sw-dev
>>> _______________________________________________
>>> Dbpedia-discussion mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
> 
> 
>
------------------------------------------------------------------------------
> Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] ANN: DBpedia 3.5 released

Reply via email to