In DBpedia we tried very hard to improve the overall quality and we are
very interested to take a closer look at the invalid triples.

Please note that the dump based framework, which was heavily refactored &
improved for 3.8 release is not in full sync with the English DBpedia Live.
We already have an improved version of the live framework that is available
for Dutch only (live.nl.ddbpedia.org)
and will be deployed for English once we test it a little more.

In that version we already take better care of IRIs because the Dutch
language has many non-ASCII characters.

Best,
Dimitris



On Wed, Apr 10, 2013 at 3:01 AM, Paul A. Houle <[email protected]> wrote:

>   PRESS RELEASE
>
> Paul Houle,  Ontology2 founder,  stated that "we updated Infovore to accept
> data from DBpedia,  and ran a head to head test,  in terms of RDF validity,
> between Freebase and DBpedia Live."
>
> "Unlike most scientific results",  he said,  "these results are repeatable,
> because you can reproduce them yourself with Infovore 1.1.  I encourage you
> to use this tool to put other RDF data sets,  large and small,  to the
> test."
>
> The tool parallelSuperEyeball was run against both the 2013-03-31 Freebase
> RDF dump and the 2012-04-30 edition of DBpedia Live.
>
> Although Freebase asserts roughly 1.2 billion facts,  Infovore rejects
> roughly
> 200 million useless facts in pre-filtering.  Downstream of that we found
> 944,909,025 valid facts and than 66,781,906 invalid facts,  in addition to
> 5 especially malformed facts.
>
> This is a serious regression compared to the 2013-01-27 RDF dump,  in which
> only about 13 million invalid triples were discovered.  The main cause of
> the
> increase is the introduction of 40 million or so "triples" lacking an
> object
> connected with the predictate ns:common.topic.notable_for.  Previously,
> the bulk of the invalid triples were incorrectly formatted dates.
>
> The rate of invalid triples in Dbpedia Live was found to be orders of
> magnitude
> less than Freebase.
>
> Only 8,664 invalid facts were found in DBpedia Live,  compared to
> 247,557,030
> valid facts.  The predominant problem in DBpedia Live turned out to be
> noncomfortmant IRIs that came in from Wikipedia.  This is comparable in
> magnitude to the number of facts found invalid in the old Freebase quad
> dump
> in the process of creating :BaseKB Pro.
>
> Just one of the tools included with Infovore,  parallelSuperEyeball is an
> industrial strength RDF validator that uses streaming processing and the
> Map/Reduce
> paradigm to attain nearly perfect parallel speedup at many tasks on common
> four core computers.  Infovore 1.1 brings many improvements,  including a
> threefold
> speedup of parallelSuperEyeball and the new Infovore shell.
>
> Please take a look at our github project at
>
> https://github.com/paulhoule/infovore/wiki
>
> and feel free to fork or star it.  Note that many infovore data products
> are
> also available at
>
> http://basekb.com/
>
> Because infovore is memory efficient,  it is possible to use it to handle
> much
> large data sets than can be kept in a triple store on any given hardware.
> The
> main limitation in handling large RDF data sets is running out of disk
> space,
> which it can do quickly by avoiding random access I/O.
>
> "We challenge RDF data providers to put their data to the test",  said Paul
> Houle,  "Today it's an expectation that people and organizations publish
> only
> valid XML files,  and the publication of superParallelEyeball is a step to
> a world that speaks valid RDF and that can clean and repair invalid files."
>
> Ontology2 is a privately held company that develops web sites and data
> products
> based on Freebase,  DBpedia,  and other sources.  Contact
> [email protected] with
> questions about Ontology2 products and services.
>
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Dbpedia-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
>


-- 
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to