Status of Freebase in RDF in Q4 2012
I'd like to share the status of my :BaseKB effort to convert Freebase to RDF,
its future, and how it relates to other efforts. (See http://basekb.com/)
This is a long letter, but the takeaway is that I’m looking to put together
some sort of a group to advance the development of :BaseKB, a product that
converts Freebase data to logically sound RDF. If this interests you, keep
reading.
:BaseKB is the first and only complete and correct conversion of Freebase data
to RDF. :BaseKB is possible because there is no fundamental philosophical
difference between the Freebase data model and the RDF data model. Freebase
spent $20 million or so developing graphd and the proprietary MQL language
because they got started early, when SPARQL didn’t exist and when there wasn’t
the vibrant competition between triple and quad store implementations that
exists today. As a result of this early start, Freebase remains the world’s
leading data wiki by a large margin.
For a long time, Freebase has made it possible to retrieve a limited fraction
of information in RDF format at rdf.freebase.com; this service, for the most
part, made it possible to retrieve triples sharing a specific subject, as
well as a number of “RDF molecules” involving CVT and mediator types.
The official RDF version of Freebase is of limited use for a few reasons: most
significantly, it has never been possible to query it with SPARQL, and second,
the practical limitations of running a public API mean that Freebase can only
publish a limited number of triples for any given subject. Although only a
small fraction of concepts in Freebase are affected by this limit, these highly
connected nodes play a critical role in the graph. Any ‘typical’ graph
algorithm will traverse these nodes and thus give unsound results. (To be
fair, this is a general problem with the ‘dereferencing’ idea in Linked Data
and not the fault of Freebase.)
Last April I released the early access version of :BaseKB, the first complete
and correct conversion of Freebase to RDF. :BaseKB was supplied as a set of
n-Triples files that could be loaded into a triple store and queried with
SPARQL 1.1.
This early access release contained a subset of information from Freebase,
including all schema objects, all concepts that exist in Wikipedia, and all
CVTs that interconnect these concepts. :BaseKB was made available for free
under a CC-BY just like Freebase.
:BaseKB was designed to be a project both competitive and complementary to
DBpedia. Anyone trying to do projects with DBpedia will discover that
intensive data cleaning is usually necessary to get correct query answers.
Freebase’s mode of operation promises better curation than DBpedia and this
translates into more correct answers with less data cleaning.
Soon after, I announced the first release of :BaseKB Pro, a commercial product
comprising all facts from Freebase, updated weekly on a subscription basis.
:BaseKB Pro was not a commercial success. I didn’t sell a single license.
Around the time this project was launched, I accepted a really great job
offer so I’ve had limited time to work on :BaseKB and related projects.
Not long after :BaseKB was announced, Google announced plans to publish an
official RDF dump for Freebase in Summer 2012. This was a welcome development,
however, this announcement is one reason why :BaseKB development was on the
back burner this summer.
In this time I’ve been happy to hear about certain work on reification at
Freebase and I’ve also seen that DBpedia and WikiData are both evolving in the
correct directions.
As of October, no RDF dump has been published by Freebase, and the information
I have leads me to believe that we can’t count on Google to provide a workable
RDF dump of Freebase. Thus, I’m beginning to reassess the competitive
landscape and to reposition :BaseKB.
I haven’t followed the discussion list closely in the past few months, but I
did see a report that a Google engineer had great difficulty loading a Freebase
dump into a triple store and concluded this wasn’t practical to do without an
exotic and expensive computer with more than 64G of RAM.
I don’t know what data set was used, or what tools. I do know that I can
easily load both :BaseKB and :BaseKB Pro on a Lenovo W520 laptop with 32 GB of
RAM (bought from Crucial at a price much lower than OEM.) I demonstrated
queries against :BaseKB Pro to individuals I met at the Semantic Technology
Conference in San Francisco this July.
Maybe it’s just hard to find good help these days.
It takes me 1 hour to load :BaseKB into Virtuoso OpenLink on an older computer
with 24 GB of RAM. I know the Franz people have loaded :BaseKB into
Allegrograph and that others have loaded it into BigData. Unlike many popular
RDF data sets, :BaseKB passes a test suite that includes a streaming version
of Jena ‘eyeball’ for a high degree of compatibility with triple stores and
other RDF tools.
:BaseKB accomplishes a considerable level of compression relative to the quad
dump and the (unusable) Linked Data representation. Nearly 8% of the quad dump
consists of statements to the effect that all but a handful of objects are
world writable. The Linked Data representation often uses two triples to
represent what :BaseKB does in one, which results in hundreds of millions of
triples of harmful overhead. Performance and compatibility with were baked
into :BaseKB in the earliest stages of development.
I’d like to say that I had some special insight into the problem of converting
Freebase data, but no, like a certain Winston Churchill quote, I discovered
the correct way to do it only after exhausting all of the other alternatives.
I’m quite fortunate that I had some time where I could avoid the usual
distractions involved with software projects and work out the math.
One problem that bothered me early on was that I needed information from the
schema to assign types to triples; if the schema told me that the object of a
certain predicate was always an integer, I could use that to generate a triple
with an integer object (that would sort, for instance, like an integer.)
This process involved joining the predicate field with the subject of a quad in
the schema field. However, a predicate like “/people/person/date_of_birth”
shows up as “/m/04m1” in the subject field. This seemed to be a “chicken and
egg” problem until I realized that, very simply, queries against Freebase
work correctly when mid identifiers are used as primary keys.
This has the disadvantage that the predicates are no longer machine readable,
you have to write something like
?person fbase:m.04m1 ?date .
in your SPARQL queries. Once I recognized that the problem of converting names
in the dump to mids was the real problem, everything else was downhill. When
you treat mids as primary queries, SPARQL queries give logically sound results
against Freebase.
The disadvantage of this, of course, is that queries are harder to write, but
this is overcome by the basekb tools
http://code.google.com/p/basekb-tools/
which rewrite names that appear in queries in the same way that the MQL query
engine does. You can join other data sets to :BaseKB by grounding (smooshing)
them through the tools. Alternatively, you can write OWL and RDFS statements
that map Freebase predicates to well-known vocabularies like foaf.
I think people have found this answer distasteful, so they’ve often tried to
substitute “human readable” identifiers for the mids. Perhaps somebody really
can make that work, but it’s a harder problem than it seems at first glance.
For instance, important predicates have more than one name and you have to
support all of them. You might think owl:sameAs would help here but it
doesn’t, not with the standard interpretation.
It’s probably not hard to make something that’s almost right but the QA work in
making something “half-baked but good enough” is often vastly greater than that
of making something perfect. I had to get the job done with a one-man army
corp, so I used the simplest possible correct answer.
I don’t know it for a fact, but I think it is quite possible that the
development of a Freebase RDF dump inside of Google may have taken a wrong turn
somewhere.
I put development of :BaseKB on hold in July for several reasons, one of which
is that I failed to sell any subscriptions for the :BaseKB Pro product. Around
this I also received a great job offer which I subsequently accepted, so I
have had less time to work on projects of this sort.
:BaseKB was clearly ahead of the market last July, but I think it’s time to
develop partnerships that will keep it relevant.
Planned monthly releases of the :BaseKB product did not materialize because I
haven’t had time to diagnose problems in the conversion process. For instance,
one quad dump that I downloaded has a single quad in shard 13 that throws an
exception in one processing stage, which causes the system to abort.
Fixing this is a matter of downloading a recent quad dump and running it in the
debugger; almost certainly it’s a very small problem. Somebody who wasn’t so
obsessive about data quality would probably be happy to just eat the exception
and lose the quad.
Similarly, I understand changes have been made in how descriptions are
implemented in Freebase and this may allow the simplification of the system,
which previously required a plurality of processing stages to merge in
descriptions from the simple topic dump as well as use a web crawler to get
descriptions from (occasionally) documented schema objects. Now things should
be simpler.
I’m not able to maintain :BaseKB on a week by week basis, so I’m looking to
the community for help. I’m working on a plan to put :BaseKB in the hands of
people who can use it and I’m considering options such as licensing the
technology behind it or donating it to an Open Source project. To do either
I’ll want to have a credible plan to make :BaseKB sustainable. Please write me
at [email protected] if you are interested.
I’ll talk a bit about the software that creates :BaseKB.
It all revolves around a framework called “Infovore” which is a RDF-centric
Map/Reduce framework that runs in multiple threads on a single computer. The
framework, at the moment, is designed for high efficiency at processing
Freebase-scale data. Unlike triple-store based system, Infovore’s streaming
processing has minimal memory requirements. In fact, the most economical
environment for running Infovore in AWS is a c1.medium instance with just 1.7
GB of RAM. It completes roughly 18 stages of processing in about 12 hours on a
c1.medium to convert the contents of Freebase to correct RDF.
I get better performance on my personal workstation, but operation and
development of Infovore is very practical with the underpowered laptops that
software developers seem to be stuck with so often.
Infovore is written in Java and uses the Jena framework; reducers collect
groups of statements together into models, upon which data transformations can
be specified using SPARQL 1.1.
Lately I’ve been working with much larger data sets in Hadoop and studying the
Map/Reduce model used there and it seems very likely that the system could be
made more scalable and clock time faster (at some increase in hardware cost per
quad) by porting it to Hadoop. I’ve also considered adding support for Sesame
and OWLIM-Lite which may give better performance and inference abilities.
Infovore also contains a system for high-speed mapping of “human readable”
Freebase identifiers to mid identifiers and an system for applying
space-efficient in-memory graph algorithms to do tasks such as correct pruning
of the complete :BaseKB Pro into the much more usable :BaseKB.
(One of the many contradictions in my business plan was that :BaseKB is a more
commercially usable product than :BaseKB Pro. :BaseKB takes advantage of the
policies of Wikipedia that lead to a much closer mapping between concepts in
the system p.o.v. to concepts in the minds of end users than exists in Freebase
as a whole.)
The pruning algorithm is capable of creating other kinds of consistent subsets;
if you want a database of concepts connected with professional wrestling or
things that Shakespeare might have known about, this is not science fiction,
it’s just what Infovore can do and you can tell it exactly what to do by
writing SPARQL queries.
I know many more people use SQL databases than use SPARQL databases today, and
Infovore has a good answer for them. SPARQL queries give answers in a tabular
format exactly like a SQL table, so it’s quite easy to define Freebase to
relational mappings with SPARQL queries.
In all, Infovore is a good answer to the high memory consumption of triple
stores; by preprocessing RDF data to contain exactly you need before loading
it into a triple store, you can handle large data sets and still enjoy the
flexibility of SPARQL and RDF.
So, if you’d like to see a and up-to-date correct conversion of Freebase to RDF
now, rather than (possibly) never, send me an email. ([email protected])
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion