Status of Freebase in RDF in Q4 2012

I'd like to share the status of my :BaseKB effort to convert Freebase to RDF,  
its future,  and how it relates to other efforts.  (See http://basekb.com/)

This is a long letter,  but the takeaway is that I’m looking to put together 
some sort of a group to advance the development of :BaseKB,  a product that 
converts Freebase data to logically sound RDF.  If this interests you, keep 
reading.

:BaseKB is the first and only complete and correct conversion of Freebase data 
to RDF.  :BaseKB is possible because there is no fundamental philosophical 
difference between the Freebase data model and the RDF data model.  Freebase 
spent $20 million or so developing graphd and the proprietary MQL language 
because they got started early, when SPARQL didn’t exist and when there wasn’t 
the vibrant competition between triple and quad store implementations that 
exists today.  As a result of this early start,  Freebase remains the world’s 
leading data wiki by a large margin.

For a long time,  Freebase has made it possible to retrieve a limited fraction 
of information in RDF format at rdf.freebase.com;  this service,  for the most 
part,  made it possible to retrieve triples sharing a specific subject,  as 
well as a number of “RDF molecules” involving CVT and mediator types.

The official RDF version of Freebase is of limited use for a few reasons:  most 
significantly, it has never been possible to query it with SPARQL, and second, 
the practical limitations of running a public API mean that Freebase can only 
publish a limited number of triples for any given subject.   Although only a 
small fraction of concepts in Freebase are affected by this limit, these highly 
connected nodes play a critical role in the graph.  Any ‘typical’ graph 
algorithm will traverse these nodes and thus give unsound results.  (To be 
fair, this is a general problem with the ‘dereferencing’ idea in Linked Data 
and not the fault of Freebase.)

Last April I released the early access version of :BaseKB,  the first complete 
and correct conversion of Freebase to RDF.  :BaseKB was supplied as a set of 
n-Triples files that could be loaded into a triple store and queried with 
SPARQL 1.1.

This early access release contained a subset of information from Freebase, 
including all schema objects,  all concepts that exist in Wikipedia,  and all 
CVTs that interconnect these concepts.  :BaseKB was made available for free 
under a CC-BY just like Freebase.

:BaseKB was designed to be a project both competitive and complementary to 
DBpedia.  Anyone trying to do projects  with DBpedia will discover that 
intensive data cleaning is usually necessary to get correct query answers.  
Freebase’s mode of operation promises better curation than  DBpedia and this 
translates into more correct answers with less data cleaning.

Soon after, I announced the first release of :BaseKB Pro,  a commercial product 
comprising all facts from Freebase,  updated weekly on a subscription basis.

:BaseKB Pro was not a commercial success.  I didn’t sell a single license.  
Around the time this project was launched,   I accepted a really great job 
offer so I’ve had limited time to work on :BaseKB and related projects.

Not long after :BaseKB was announced,  Google announced plans to publish an 
official RDF dump for Freebase in Summer 2012.  This was a welcome development, 
  however, this announcement is one reason why :BaseKB development was on the 
back burner this summer.

In this time I’ve been happy to hear about certain work on reification at 
Freebase and I’ve also seen that DBpedia and WikiData are both evolving in the 
correct directions.  

As of October, no RDF dump has been published by Freebase, and the information 
I have leads me to believe that we can’t count on Google to provide a workable 
RDF dump of Freebase.  Thus, I’m beginning  to reassess the competitive 
landscape and to reposition :BaseKB.

I haven’t followed the discussion list closely in the past few months, but I 
did see a report that a Google engineer had great difficulty loading a Freebase 
dump into a triple store and concluded this wasn’t practical to do without an 
exotic and expensive computer with more than 64G of RAM.

I don’t know what data set was used, or what tools.   I do know that I can 
easily load both :BaseKB and :BaseKB Pro on a Lenovo W520 laptop with 32 GB of 
RAM (bought from Crucial at a price much lower than OEM.)  I demonstrated 
queries against :BaseKB Pro to individuals I met at the Semantic Technology 
Conference in San Francisco this July.  

Maybe it’s just hard to find good help these days.

It takes me 1 hour to load :BaseKB into Virtuoso OpenLink on an older computer 
with 24 GB of RAM.  I know the Franz people have loaded :BaseKB into 
Allegrograph and that others have loaded it into BigData.  Unlike many popular 
RDF data sets,  :BaseKB passes a test suite that includes a streaming version 
of Jena ‘eyeball’ for a high degree of compatibility with triple stores and 
other RDF tools.

:BaseKB accomplishes a considerable level of compression relative to the quad 
dump and the (unusable) Linked Data representation.  Nearly 8% of the quad dump 
consists of statements to the effect that all but a handful of objects are 
world writable.  The Linked Data representation often uses two triples to 
represent what :BaseKB does in one,  which results in hundreds of millions of 
triples of harmful overhead.  Performance and compatibility with were baked 
into :BaseKB in the earliest stages of development.

I’d like to say that I had some special insight into the problem of converting 
Freebase data, but no, like a certain Winston Churchill quote,  I discovered 
the correct way to do it only after exhausting all of the other alternatives.  
I’m quite fortunate that I had some time where I could avoid the usual 
distractions involved with software projects and work out the math.

One problem that bothered me early on was that I needed information from the 
schema to assign types to triples; if the schema told me that the object of a 
certain predicate was always an integer, I could use that to generate a triple 
with an integer object (that would sort, for instance, like an integer.)

This process involved joining the predicate field with the subject of a quad in 
the schema field.  However, a predicate like “/people/person/date_of_birth” 
shows up as “/m/04m1” in the subject field.  This seemed to be a “chicken and 
egg” problem until I realized that,  very simply,  queries against Freebase 
work correctly when mid identifiers are used as primary keys.

This has the disadvantage that the predicates are no longer machine readable, 
you have to write something like

?person fbase:m.04m1 ?date .

in your SPARQL queries.  Once I recognized that the problem of converting names 
in the dump to mids was the real problem, everything else was downhill.  When 
you treat mids as primary queries,  SPARQL queries give logically sound results 
against Freebase.

The disadvantage of this, of course, is that queries are harder to write, but 
this is overcome by the basekb tools

http://code.google.com/p/basekb-tools/

which rewrite names that appear in queries in the same way that the MQL query 
engine does.  You can join other data sets to :BaseKB by grounding (smooshing) 
them through the tools.   Alternatively,  you can write OWL and RDFS statements 
that map Freebase predicates to well-known vocabularies like foaf.  

I think people have found this answer distasteful, so they’ve often tried to 
substitute “human readable” identifiers for the mids.  Perhaps somebody really 
can make that work, but it’s a harder problem than it seems at first glance.  
For instance,  important predicates have more than one name and you have to 
support all of them.  You might think owl:sameAs would help here but it 
doesn’t,  not with the standard interpretation.    

It’s probably not hard to make something that’s almost right but the QA work in 
making something “half-baked but good enough” is often vastly greater than that 
of making something perfect.  I had to get the job done with a one-man army 
corp, so I used the simplest possible correct answer.

I don’t know it for a fact, but I think it is quite possible that the 
development of a Freebase RDF dump inside of Google may have taken a wrong turn 
somewhere.

I put development of :BaseKB on hold in July for several reasons,  one of which 
is that I failed to sell any subscriptions for the :BaseKB Pro product.  Around 
this I also received  a great job offer which I subsequently  accepted, so I 
have had less time to work on projects of this sort.

:BaseKB was clearly ahead of the market last July,  but I think it’s time to 
develop partnerships that will keep it relevant.

Planned monthly releases of the :BaseKB product did not materialize because I 
haven’t had time to diagnose problems in the conversion process.  For instance, 
one quad dump that I downloaded has a single quad in shard 13 that throws an 
exception in one processing stage, which causes the system to abort.  

Fixing this is a matter of downloading a recent quad dump and running it in the 
debugger; almost certainly it’s a very small problem.  Somebody who wasn’t so 
obsessive about data quality would probably be happy to just eat the exception 
and lose the quad.

 Similarly, I understand changes have been made in how descriptions are 
implemented in Freebase and this may allow the simplification of the system, 
which previously required a plurality of processing stages to merge in 
descriptions from the simple topic dump as well as use a web crawler to get 
descriptions from (occasionally) documented schema objects.  Now things should 
be simpler.

I’m not able to maintain :BaseKB on a week by week basis,  so I’m looking to 
the community for help.  I’m working on a plan to put :BaseKB in the hands of 
people who can use it and I’m considering options such as licensing the 
technology behind it or donating it to an Open Source project.  To do either 
I’ll want to have a credible plan to make :BaseKB sustainable.  Please write me 
at [email protected] if you are interested.

I’ll talk a bit about the software that creates :BaseKB.

It all revolves around a framework called “Infovore” which is a RDF-centric 
Map/Reduce framework that runs in multiple threads on a single computer.  The 
framework, at the moment, is designed for high efficiency at processing 
Freebase-scale data.  Unlike triple-store based system,  Infovore’s streaming 
processing has minimal memory requirements.  In fact, the most economical 
environment for running Infovore in AWS is a c1.medium instance with just 1.7 
GB of RAM.  It completes roughly 18 stages of processing in about 12 hours on a 
c1.medium to convert the contents of Freebase to correct RDF.  

I get better performance on my personal workstation, but operation and 
development of Infovore is very practical with the underpowered laptops that 
software developers seem to be stuck with so often.

Infovore is written in Java and uses the Jena framework;  reducers collect 
groups of statements together into models,  upon which data transformations can 
be specified using SPARQL 1.1.

Lately I’ve been working with much larger data sets in Hadoop and studying the 
Map/Reduce model used there and it seems very likely that the system could be 
made more scalable and clock time faster (at some increase in hardware cost per 
quad) by porting it to Hadoop.  I’ve also considered adding support for Sesame 
and OWLIM-Lite which may give better performance and inference abilities.

Infovore also contains a system for high-speed mapping of “human readable” 
Freebase identifiers to mid identifiers and an system for applying 
space-efficient in-memory graph algorithms to do tasks such as correct pruning 
of the complete :BaseKB Pro into the much more usable :BaseKB.

(One of the many contradictions  in my business plan was that :BaseKB is a more 
commercially usable product than :BaseKB Pro.  :BaseKB takes advantage of the 
policies of Wikipedia that lead to a much closer mapping between concepts in 
the system p.o.v. to concepts in the minds of end users than exists in Freebase 
as a whole.)

The pruning algorithm is capable of creating other kinds of consistent subsets; 
 if you want a database of concepts connected with professional wrestling or 
things that Shakespeare might have known about,  this is not science fiction,  
it’s just what Infovore can do and you can tell it exactly what to do by 
writing SPARQL queries.

I know many more people use SQL databases than use SPARQL databases today,  and 
Infovore has a good answer for them.  SPARQL queries give answers in a tabular 
format exactly like a SQL table,  so it’s quite easy to define Freebase to 
relational mappings with SPARQL queries.

In all,  Infovore is a good answer to the high memory consumption of triple 
stores;  by preprocessing RDF data to contain exactly you need before loading 
it into a triple store,  you can handle large data sets and still enjoy the 
flexibility of SPARQL and RDF.

So, if you’d like to see a and up-to-date correct conversion of Freebase to RDF 
now,   rather than (possibly) never,  send me an email.  ([email protected])


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to