Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Sebastian Schaffert Wed, 14 Nov 2012 11:33:15 -0800

Am 13.11.2012 um 14:50 schrieb Reto Bachmann-Gmür:

> On Tue, Nov 13, 2012 at 1:31 PM, Sebastian Schaffert <
> [email protected]> wrote:
> [...]
> 
>> 
>> Despite the solution I described, I still do not think the scenario is
>> well suited for evaluating RDF APIs. You also do not use Hibernate to
>> evaluate whether an RDBMS is good or not.
>> 
> The usecase I propose and I don't think this is the only one, I just think
> that API comparison should be based on evaluating their suitability for
> different concretely defined usecases. It has nothing to do with
> hibernation neither with annotation based object to rdf property mapping
> (as there have been several proposals). Its the same principle of any23 or
> aperture but not on the binary data level but on the java object level.


The Java domain object level is one level of abstraction above the data 
representation/storage level. I was mentioning Hibernate as an example of a 
generic mapping between the java object level and the data representation level 
(even though in this case it is relational database, the same can be done for 
RDF). The Java object level does not really allow to draw good conclusions 
about the data representation level.

> I have my instrastructure that deals with graphs I have the a Set of contacts
> how does the missing bit look like to process this set with my rdf
> infrastructure. Its a reality that people don't (yet) have all their data
> as graphs, they might have some contacts in LDAP and some mails on an Imap
> server.


I showed you an example of annotation based object to RDF mapping to fill 
exactly that missing bit. This implementation works on any RDF API (we had it 
in Sesame, in KiWi, and now in the LMF) and has been done by several other 
people as well. It does not really help much in deciding how the RDF API itself 
should look like, though. 

> 
> 
>>>> 
>>>> If this is really an issue, I would suggest coming up with a bigger
>>>> collection of RDF API usage scenarios that are also relevant in practice
>>>> (as proven by a software project using it). Including scenarios how to
>> deal
>>>> with bigger amounts of data (i.e. beyond toy examples). My scenarios
>>>> typically include >= 100 million triples. ;-)
>>>> 
>>>> In addition to what Andy said about wrapper APIs, I would also like to
>>>> emphasise the incurred memory and computation overhead of wrapper APIs.
>> Not
>>>> an issue if you have only a handful of triples, but a big issue when you
>>>> have 100 million.
>> 
> A wrapper doesn't means you have an in memory objects for all your triples
> of your store, that's absurd. But if your code deals with some resources at
> runtime these resource are represented by object instances which contain at
> least a pointer to the resource located of the RAM. So the overhead of a
> wrapper is linear to the amount of RAM the application would need anyway
> and independent of the size of the triple store.

So in other words: instead of a server with 8GB I might need one with 10GB RAM, 
just because I decided using a wrapper instead of the native API. Or to put it 
differently: with the same server I can hold less objects in my in-memory 
cache, possibly sacrificing a lot of processing time. From my experience, it 
makes a big difference.

> Besides I would like to
> compare possible APIs here, ideally the best API would be largely adopted
> making wrapper superfluous. (I could also mention that the jena Model class
> also wraps a Graph instance)

Agreed.

> 
> 
>> 
>>> It's a common misconception to think that java sets are limited to 231-1
>>> elements, but even that would be more than 100 millions. In the
>> challenge I
>>> didn't ask for time complexity, it would be fair to ask for that too if
>> you
>>> want to analyze scenarios with such big number of triples.
>> 
>> It is a common misconception that just because you have a 64bit
>> architecture you also have 2^64 bits of memory available. And it is a
>> common misconception that in-memory data representation means you do not
>> need to take into account storage structures like indexes. Even if you
>> represent this amount of data in memory, you will run into the same problem.
>> 
>> 95% of all RDF scenarios will require persistent storage. Selecting a
>> scenario that does not take this into account is useless.
>> 
> 
> I don't know where your RAM fixation comes from.

I started programming with 64kbyte and grew up into Computer Science when 
"640kbyte ought to be enough for anyone" ;-)

Joke aside, it comes from the real world use cases we are working on, e.g. a 
Linked Data and Semantic Search server at http://search.salzburg.com, 
representing about 1,2 million news articles as RDF, resulting in about 140 
million triples. It also comes from my experience with IkeWiki, which was a 
Semantic Wiki system completely built on RDF (using Jena at that time).

The server the partner has provided us with for the Semantic Search has 3GB of 
RAM and is a virtual VMWare instance with not the best I/O performance. 
Importing all news articles on this server and processing them takes 2 weeks 
(after spending many days doing performance profiling with YourKit and 
identifying bottlenecks and unnecessary overheads like wrappers or proxy 
classes). If I have a wrapper implementation inbetween, even lightweight, maybe 
just takes 10% more, i.e. 1,5 days! The performance overhead clearly matters. 

In virtually all my RDF projects of the last 10-12 years, the CENTRAL issues 
were always efficient/effective/reliable/convenient storage and 
efficient/effective/reliable/convenient querying (in parallel environments). 
These are the criteria an RDF API should IMHO be evaluated against. In my 
personal experience, the data model and repository API of Sesame was the best 
choice to cover these scenarios in all different kinds of use cases I had so 
far (small data and big data). It was also the most flexible option, because of 
its consistent use of interfaces and modular choice of backends. Jena comes 
close, but did not yet go through the architectural changes (i.e. interface 
based data model) that Sesame already did with the 2.x series. Clerezza so far 
is not a real option to achieve my goals. It is good and convenient when 
working with small in-memory representations of graphs, but (as we discussed 
before) lacks for me important persistence and querying features. If I am 
purely interested in Sets of triples, guess what: I create a Java Set and put 
triples in it. For example, we even have an extended set with a (limited) query 
index support [1], which I created out of realizing that we spent a 
considerable time just iterating unnecessarily over sets. No need for a new API.

[1] 
http://code.google.com/p/lmf/source/browse/lmf-core/src/main/java/kiwi/core/model/table/TripleTable.java
 

> My usecases doesn't mandate in memory storage in any way. The 2^31-1 
> misconception comes not
> from 32bit architecture but from the fact that Set.size() is defined to
> return an int value (i.e. a maximum of 2^31-1) but the API is clear that a
> Set can be bigger than that.  

I did not come up with any 2^31 misconception. And *of course* the 2^31-1 topic 
is originally caused by 32 bit architectures, because this is why integer (in 
Java) is defined as 32bit (the size you can store in a processor register so 
simple computations only require a single instruction of the processor). And 
the fact that Java is using 32bit ints for many things DOES cause problems, as 
Rupert can tell you from experience: it might e.g. happen that two completely 
different objects share the same hash code, because the hash code is an integer 
while the memory address is a long.

What I was referring to is that regardless the amount of memory you have, 
persistence and querying is the core functionality of any RDF API. The use 
cases where you are working with RDF data and don't need persistence are rare 
(serializing and deserializing domain objects via RDF comes to my mind) and for 
consistency reasons I prefer treating them in the same way as the persistent 
cases, even if it means that I have to deal with persistence concepts (e.g. 
repository connections or transactions) without direct need. On the other hand, 
persistence comes with some important requirements, which are known for long 
and summarized in the ACID principles, and which need to be satisfied by an RDF 
API.

> And again other usecase are welcome, lets
> look at how they can be implemented with different APIs, how elegant the
> solutions are, what they runtime properties are and of course how relevant
> the usecases are to find the most suitable API.


Ok, my challenges (from a real project):
- I want to be able to run a crawler over skiing forums, extract the topics, 
posts, and user information from them, perform a POS tagging and sentiment 
analysis and store the results together with the post content in my RDF 
repository;
- in case one of the processes inbetween fails (e.g. due to a network error), I 
want to properly roll back all changes made to the repository while processing 
this particular post or topic 
- I want to expose this dataset (with 10 million posts and 1 billion triples) 
as Linked Data, possibly taking into account a big number of parallel requests 
on that data (e.g. while Linked Data researchers are preparing their articles 
for ISWC) 
- I want to run complex aggregate queries over big datasets (while the crawling 
process is still running!), e.g. "give me all forum posts out of a set of 10 
million on skiing that are concerned with 'carving skiing' with an average 
sentiment of >0.5 for mentionings of the noun phrase 'Atomic Racer SL6' and 
display for each the number of replies in the forum topic"
- I want to store a SKOS thesaurus on skiing in a separate named graph and run 
queries over the combination of the big data set of posts and the small 
thesaurus (e.g. to get the labels of concepts instead of the URI)
- I want to have a configurable rule-based reasoner where I can add simple 
rules like a "broaderTransitive" rule for the SKOS broader relationship; it has 
to run on 1 billion triples
- I want to repeat the crawling process every X days, possibly updating post 
data in case something has changed, even while another crawling process is 
running and another user is running a complex query

With the same API model (i.e. without learning a different API), I also want to:
- with a few lines import a small RDF document into memory to run some small 
tests
- take a bunch of triples and serialize them as RDF/XML or N3


Cheers, ;-)


Sebastian
-- 
| Dr. Sebastian Schaffert          [email protected]
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg

Re: Toy-Usecase challenge for comparing RDF APIs to wrap data (was Re: Future of Clerezza and Stanbol)

Reply via email to