PRESS RELEASE

Paul Houle,  Ontology2 founder,  stated that "we updated Infovore to accept
data from DBpedia,  and ran a head to head test,  in terms of RDF validity,
between Freebase and DBpedia Live."

"Unlike most scientific results",  he said,  "these results are repeatable,
because you can reproduce them yourself with Infovore 1.1.  I encourage you
to use this tool to put other RDF data sets,  large and small,  to the test."

The tool parallelSuperEyeball was run against both the 2013-03-31 Freebase
RDF dump and the 2012-04-30 edition of DBpedia Live.

Although Freebase asserts roughly 1.2 billion facts,  Infovore rejects roughly
200 million useless facts in pre-filtering.  Downstream of that we found 
944,909,025 valid facts and than 66,781,906 invalid facts,  in addition to
5 especially malformed facts.

This is a serious regression compared to the 2013-01-27 RDF dump,  in which
only about 13 million invalid triples were discovered.  The main cause of the
increase is the introduction of 40 million or so "triples" lacking an object
connected with the predictate ns:common.topic.notable_for.  Previously,
the bulk of the invalid triples were incorrectly formatted dates.

The rate of invalid triples in Dbpedia Live was found to be orders of magnitude
less than Freebase.

Only 8,664 invalid facts were found in DBpedia Live,  compared to 247,557,030
valid facts.  The predominant problem in DBpedia Live turned out to be
noncomfortmant IRIs that came in from Wikipedia.  This is comparable in
magnitude to the number of facts found invalid in the old Freebase quad dump
in the process of creating :BaseKB Pro.

Just one of the tools included with Infovore,  parallelSuperEyeball is an
industrial strength RDF validator that uses streaming processing and the 
Map/Reduce
paradigm to attain nearly perfect parallel speedup at many tasks on common
four core computers.  Infovore 1.1 brings many improvements,  including a 
threefold
speedup of parallelSuperEyeball and the new Infovore shell.

Please take a look at our github project at

https://github.com/paulhoule/infovore/wiki

and feel free to fork or star it.  Note that many infovore data products are
also available at

http://basekb.com/

Because infovore is memory efficient,  it is possible to use it to handle much
large data sets than can be kept in a triple store on any given hardware.  The
main limitation in handling large RDF data sets is running out of disk space,
which it can do quickly by avoiding random access I/O.

"We challenge RDF data providers to put their data to the test",  said Paul
Houle,  "Today it's an expectation that people and organizations publish only
valid XML files,  and the publication of superParallelEyeball is a step to
a world that speaks valid RDF and that can clean and repair invalid files."

Ontology2 is a privately held company that develops web sites and data products
based on Freebase,  DBpedia,  and other sources.  Contact [email protected] 
with
questions about Ontology2 products and services.






------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to