Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-30 Thread Kingsley Idehen


François-Paul Servant wrote:

Kingsley Idehen a écrit :

Chris Bizer wrote:

...
Depending on how much data integration you need, you then start to 
apply
some identity resolution and schema mapping techniques. We have been 
talking
to some pharma and media companies that do data warehousing for 
years and

they all seam to be very interested in this quick and dirty approach.
  
"Quick & Dirty" is simply not how I would characterize this matter. I 
prefer to describe this as step 1 in a multi phased approach to RDF 
based data integration.


Kingsley, I agree. I find it clean to make the basic but necessary 
first steps towards data integration (identification of things, 
mapping, etc.). Building upon legacy systems, that you just have to 
adapt, doesn't make the approach dirty: it makes it possible ;-)

Francois,

Yes.

In my world view, "quick & dirty" is synonymous with inherent inability 
to scale across quality vectors such as:

1. Integration use-case complexity
2. Data volume growth
3. Concurrent usage growth
4. Deployment target growth


When encounter "quick & dirty" associated with anything in the  
technology realm, the items above send off deafening alarm bells in my 
head :-)


Kingsley



cheers,

fps


-- Disclaimer 
Ce message ainsi que les eventuelles pieces jointes constituent une 
correspondance privee et confidentielle a l'attention exclusive du 
destinataire designe ci-dessus. Si vous n'etes pas le destinataire du 
present message ou une personne susceptible de pouvoir le lui 
delivrer, il vous est signifie que toute divulgation, distribution ou 
copie de cette transmission est strictement interdite. Si vous avez 
recu ce message par erreur, nous vous remercions d'en informer 
l'expediteur par telephone ou de lui retourner le present message, 
puis d'effacer immediatement ce message de votre systeme.

***
This e-mail and any attachments is a confidential correspondence 
intended only for use of the individual or entity named above. If you 
are not the intended recipient or the agent responsible for delivering 
the message to the intended recipient, you are hereby notified that 
any disclosure, distribution or copying of this communication is 
strictly prohibited. If you have received this communication in error, 
please notify the sender by phone or by replying this message, and 
then delete this message from your system.



--


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software Web: http://www.openlinksw.com








Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-30 Thread François-Paul Servant

Kingsley Idehen a écrit :

Chris Bizer wrote:

...
Depending on how much data integration you need, you then start to apply
some identity resolution and schema mapping techniques. We have been 
talking

to some pharma and media companies that do data warehousing for years and
they all seam to be very interested in this quick and dirty approach.
  
"Quick & Dirty" is simply not how I would characterize this matter. I 
prefer to describe this as step 1 in a multi phased approach to RDF 
based data integration.


Kingsley, I agree. I find it clean to make the basic but necessary first steps 
towards data integration (identification of things, mapping, etc.). Building 
upon legacy systems, that you just have to adapt, doesn't make the approach 
dirty: it makes it possible ;-)


cheers,

fps


-- Disclaimer 
Ce message ainsi que les eventuelles pieces jointes constituent une 
correspondance privee et confidentielle a l'attention exclusive du destinataire 
designe ci-dessus. Si vous n'etes pas le destinataire du present message ou une 
personne susceptible de pouvoir le lui delivrer, il vous est signifie que toute 
divulgation, distribution ou copie de cette transmission est strictement 
interdite. Si vous avez recu ce message par erreur, nous vous remercions d'en 
informer l'expediteur par telephone ou de lui retourner le present message, 
puis d'effacer immediatement ce message de votre systeme.
***
This e-mail and any attachments is a confidential correspondence intended only 
for use of the individual or entity named above. If you are not the intended 
recipient or the agent responsible for delivering the message to the intended 
recipient, you are hereby notified that any disclosure, distribution or copying 
of this communication is strictly prohibited. If you have received this 
communication in error, please notify the sender by phone or by replying this 
message, and then delete this message from your system.


Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-30 Thread Kingsley Idehen


Chris Bizer wrote:

Hi Orri,

  
It is my feeling that RDF has a dual role:  
1. interchange format:  This is like what XML does, except that RDF has

more semantics and expressivity.  
  

2: Database storage format for cases where data must be integrated and is

too heterogenous to easily 
  

fall into one relational schema.  This is for example the case in the open

web conversation and social 
  
space.  The first case is for mapping, the second for warehousing.   
Aside this, there is potential for more expressive queries through  the


query language dealing with
  

inferencing, like  subclass/subproperty/transitive etc.  These do not go


very well with SQL views.

I cannot agree more with what you say :-)

We are seeing the first RDF use case emerge within initiatives like the
Linking Open Data effort, where beside of being more expressive, RDF is also
playing its strength to provide for data links between record in different
databases.
  


Chris,

Talking with people from industry, I get the feeling that also more and more
people understand the second use case and that RDF is increasingly used as a
technology for something like "poor man's data integration". You don't have
to spend a lot of time and money one designing a comprehensive data
warehouse. You just throw data having different schemata from different
sources together and instantly get the benefit that you can browse and query
the data and that you have proper provenance tracking (using Named Graphs).
Depending on how much data integration you need, you then start to apply
some identity resolution and schema mapping techniques. We have been talking
to some pharma and media companies that do data warehousing for years and
they all seam to be very interested in this quick and dirty approach.
  
"Quick & Dirty" is simply not how I would characterize this matter. I 
prefer to describe this as step 1 in a multi phased approach to RDF 
based data integration.

For both use cases, inferencing is a nice add-on but not essential. Within
the first use case, inferencing usually does not work as data published by
various autonomous sources tends to be to dirty for reasoning engines.
  
Inferencing is not a nice add-on, it is essential (in various degrees) 
once you get beyond the initial stages of heterogeneous data 
integration. As with all things, these matters are connected and 
inherently symbiotic: you can't inference without having something you 
want to reason about available in palatable form, which goes back to the 
phased approach I refer to above.


In my eyes, and experience, RDF is a powerful vehicle for implementing 
conceptual level data access that sits atop heterogeneous data sources. 
It's novelty comes from the platform independence that it injects into 
the data integration technology realm.



Kingsley

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im
Auftrag von Orri Erling
Gesendet: Dienstag, 30. September 2008 00:16
An: 'Seaborne, Andy'; 'Story Henry'
Cc: [EMAIL PROTECTED]
Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
TDB, D2R Server, and MySQL


>From Henry Story:

  

As a matter of interest, would it be possible to develop RDF stores
that optimize the layout of the data by analyzing the queries to the
database? A bit like a Java Just In Time compiler analyses the usage
of the classes in order to decide how to optimize the compilation.



>From Andy Seaborne:

On a similar note, by mining the query logs it would be possible to create
parameterised queries and associated plan fragments without the client
needing to notify the server of the templates.  Couple with automatically
calculating possible materialized views or other layout optimizations, the
poor, overworked client application writer doesn't get brought into
optimizing the server.

Andy

  
 
Orri here:


With the BSBM workload, using parametrized queries as a small scale saves
roughly 1/3 of the execution time.  It is possible to remember query plans
and to notice if the same query text is submitted with only changes in
literal values.  If the first query ran quickly, one may presume the query
with substitutions will also run quickly.  There are of course exceptions.
But detecting these will mean running most of the optimizer cost model and
will eliminate any benefit from caching.


The other optimizations suggested have a larger upside but are far harder.  
I would say that if we have a predictable workload, then mapping 
relational to RDF is a lot easier than expecting the DBMS to figure out

materialized views to do the same.  If we do not have a predictable
workload, then making too many materialized views based on transient usage
patterns is a large downside because it grows the database, meaning less
working set.  The difference between in memory random access and a

Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-30 Thread Chris Bizer


Hi Orri,

> It is my feeling that RDF has a dual role:  
> 1. interchange format:  This is like what XML does, except that RDF has
more semantics and expressivity.  
> 2: Database storage format for cases where data must be integrated and is
too heterogenous to easily 
> fall into one relational schema.  This is for example the case in the open
web conversation and social 
> space.  The first case is for mapping, the second for warehousing.   
> Aside this, there is potential for more expressive queries through  the
query language dealing with
> inferencing, like  subclass/subproperty/transitive etc.  These do not go
very well with SQL views.

I cannot agree more with what you say :-)

We are seeing the first RDF use case emerge within initiatives like the
Linking Open Data effort, where beside of being more expressive, RDF is also
playing its strength to provide for data links between record in different
databases.

Talking with people from industry, I get the feeling that also more and more
people understand the second use case and that RDF is increasingly used as a
technology for something like "poor man's data integration". You don't have
to spend a lot of time and money one designing a comprehensive data
warehouse. You just throw data having different schemata from different
sources together and instantly get the benefit that you can browse and query
the data and that you have proper provenance tracking (using Named Graphs).
Depending on how much data integration you need, you then start to apply
some identity resolution and schema mapping techniques. We have been talking
to some pharma and media companies that do data warehousing for years and
they all seam to be very interested in this quick and dirty approach.

For both use cases, inferencing is a nice add-on but not essential. Within
the first use case, inferencing usually does not work as data published by
various autonomous sources tends to be to dirty for reasoning engines.

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im
Auftrag von Orri Erling
Gesendet: Dienstag, 30. September 2008 00:16
An: 'Seaborne, Andy'; 'Story Henry'
Cc: [EMAIL PROTECTED]
Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
TDB, D2R Server, and MySQL


>From Henry Story:

>
>
> As a matter of interest, would it be possible to develop RDF stores
> that optimize the layout of the data by analyzing the queries to the
> database? A bit like a Java Just In Time compiler analyses the usage
> of the classes in order to decide how to optimize the compilation.

>From Andy Seaborne:

On a similar note, by mining the query logs it would be possible to create
parameterised queries and associated plan fragments without the client
needing to notify the server of the templates.  Couple with automatically
calculating possible materialized views or other layout optimizations, the
poor, overworked client application writer doesn't get brought into
optimizing the server.

Andy

>
 
Orri here:

With the BSBM workload, using parametrized queries as a small scale saves
roughly 1/3 of the execution time.  It is possible to remember query plans
and to notice if the same query text is submitted with only changes in
literal values.  If the first query ran quickly, one may presume the query
with substitutions will also run quickly.  There are of course exceptions.
But detecting these will mean running most of the optimizer cost model and
will eliminate any benefit from caching.


The other optimizations suggested have a larger upside but are far harder.  
I would say that if we have a predictable workload, then mapping 
relational to RDF is a lot easier than expecting the DBMS to figure out
materialized views to do the same.  If we do not have a predictable
workload, then making too many materialized views based on transient usage
patterns is a large downside because it grows the database, meaning less
working set.  The difference between in memory random access and a random
access with disk is about 5000 times.  Plus there is a high cost to making
the views, thus a high penalty for wrong guess.  Andif it is hard enough
to figure out where a query plan goes wrong with a given schema, it is
harder still to figure it out with a schema that morphs by itself.

In the RDB world, for example Oracle recommends saving optimizer statistics
from the  test  environment and using these in the production environment
just so the optimizer does not get creative.  Now this is the  essence of
wisdom for OLTP but we are not talking OLTP with RDF. 

If there is a history of usage and this history is steady and the dba can
confirm it as being a representative sample, then automatic materializing
of joins is a real  possibility.  Doing this spontaneously would lead to
erratic response times, though.  For anything online, the accent i

Re: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-25 Thread Kingsley Idehen


Chris Bizer wrote:

Hi Kingsley and Paul,

Yes, I completely agree with you that different storage solutions fit
different use cases and that one of the main strengths of the RDF data model
is its flexibility and the possibility to mix different schemata.

Nevertheless, it think it is useful to give application developers an
indicator about what performance they can expect when they choose a specific
architecture, which is what the benchmark is trying to do.

We plan to run the benchmark again in January and it would be great to also
test Tucana/Kowari/Mulgara in this run.

As the performance of RDF stores is constantly improving, let's also hope
that the picture will not look that bad for them anymore then.

Cheers,

Chris

  

Chris,

Yes, but the user profile has to be a little clearer. If you separate 
the results in the narrative you achieve the goal. You can use SQL 
numbers as a sort of benchamark if you clearly explain the nature skew 
that SQL enjoys due to the nature of the schema.
We plan to run the benchmark again in January and it would be great to 
also

test Tucana/Kowari/Mulgara in this run.

As the performance of RDF stores is constantly improving, let's also hope
that the picture will not look that bad for them anymore then.
  
But at the current time, there is no clear sense of what better means 
:-) What's the goal?


What I fundamentally take from the benchmarks are the following:

1. Native RDF and RDF Views/Mapper scalability is becoming less of an 
issue (of course depending on your choice of product) and we are already 
at the point where this technology can be used for real-world solutions 
that have enterprise level scalability demands and expectations


2. It's impractical to create RDF warehouses from a existing SQL Data 
Sources when you can put RDF Views / Wrappers in front of the SQL Data 
Sources (SQL cost optimization technology has evolved significantly over 
the years across RDBMS engines).



And Yes, I would also like to see Mulgara and others RDF Stores in the 
next round of benchmarks :-)


Kingsley

-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag
von Kingsley Idehen
Gesendet: Mittwoch, 24. September 2008 20:57
An: Paul Gearon
Cc: [EMAIL PROTECTED]; public-lod@w3.org
Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
TDB, D2R Server, and MySQL


Paul Gearon wrote:
  

On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote:
  


On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:

  

Has has there been any analysis on whether there is a *fundamental*
reason for such performance difference? Or is it simply a question of
"maturity"; in other words, relational db technology has been around
  

for a
  

very long time and is very mature, whereas RDF implementations are
  

still
  

quite recent, so this gap will surely narrow ...?

  

This is a very complex subject.  I will offer some analysis below, but
this I fear will only raise further questions.  This is not the end of


the
  

road, far from it.
  


As far as I understand, another issue is relevant: this benchmark is
somewhat unfair as the relational stores have one advantage compared to
  

the
  

native triple stores: the relational data structure is fixed (Products,
Producers, Reviews, etc with given columns), while the triple
  

representation
  

is generic (arbitrary s,p,o).

  

This point has an effect on several levels.

For instance, the flexibility afforded by triples means that objects
stored in this structure require processing just to piece it all
together, whereas the RDBMS has already encoded the structure into the
table. Ironically, this is exactly the reason we
(Tucana/Kowari/Mulgara) ended up building an RDF database instead of
building on top of an RDBMS: The flexibility in table structure was
less efficient that a system that just "knew" it only had to deal with
3 columns. Obviously the shape of the data (among other things)
dictates what it is the better type of storage to use.

A related point is that processing RDF to create an object means you
have to move around a lot in the graph. This could mean a lot of
seeking on disk, while an RDBMS will usually find the entire object in
one place on the disk. And seeks kill performance.

This leads to the operations used to build objects from an RDF store.
A single object often requires the traversal of several statements,
where the object of one statement becomes the subject of the next.
Since the tables are typically represented as
Subject/Predicate/Object, this means that the main table will be
"joined" against itself. Even RDBMSs are notorious for not doing this
efficiently.

One of the problems with self-joins is that efficient operations like
merge-joins (when they can be identified) will still result in lots of
s

Re: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-25 Thread Kingsley Idehen


Chris Bizer wrote:

Hi Kingsley and Paul,

Yes, I completely agree with you that different storage solutions fit
different use cases and that one of the main strengths of the RDF data model
is its flexibility and the possibility to mix different schemata.

Nevertheless, it think it is useful to give application developers an
indicator about what performance they can expect when they choose a specific
architecture, which is what the benchmark is trying to do.
  

Chris,

Yes, but the user profile has to be a little clearer. If you separate 
the results in the narrative you achieve the goal. You can use SQL 
numbers as a sort of benchamark if you clearly explain the nature skew 
that SQL enjoys due to the nature of the schema. 

We plan to run the benchmark again in January and it would be great to also
test Tucana/Kowari/Mulgara in this run.

As the performance of RDF stores is constantly improving, let's also hope
that the picture will not look that bad for them anymore then.
  
But at the current time, there is no clear sense of what better means 
:-) What's the goal?


What I fundamentally take from the benchmarks are the following:

1. Native RDF and RDF Views/Mapper scalability is becoming less of an 
issue (of course depending on your choice of product) and we are already 
at the point where this technology can be used for real-world solutions 
that have enterprise level scalability demands and expectations


2. It's impractical to create RDF warehouses from a existing SQL Data 
Sources when you can put RDF Views / Wrappers in front of the SQL Data 
Sources (SQL cost optimization technology has evolved significantly over 
the years across RDBMS engines).



And Yes, I would also like to see Mulgara and others RDF Stores in the 
next round of benchmarks :-)


Kingsley

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag
von Kingsley Idehen
Gesendet: Mittwoch, 24. September 2008 20:57
An: Paul Gearon
Cc: [EMAIL PROTECTED]; public-lod@w3.org
Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
TDB, D2R Server, and MySQL


Paul Gearon wrote:
  

On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote:
  


On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:

  

Has has there been any analysis on whether there is a *fundamental*
reason for such performance difference? Or is it simply a question of
"maturity"; in other words, relational db technology has been around
  

for a
  

very long time and is very mature, whereas RDF implementations are
  

still
  

quite recent, so this gap will surely narrow ...?

  

This is a very complex subject.  I will offer some analysis below, but
this I fear will only raise further questions.  This is not the end of


the
  

road, far from it.
  


As far as I understand, another issue is relevant: this benchmark is
somewhat unfair as the relational stores have one advantage compared to
  

the
  

native triple stores: the relational data structure is fixed (Products,
Producers, Reviews, etc with given columns), while the triple
  

representation
  

is generic (arbitrary s,p,o).

  

This point has an effect on several levels.

For instance, the flexibility afforded by triples means that objects
stored in this structure require processing just to piece it all
together, whereas the RDBMS has already encoded the structure into the
table. Ironically, this is exactly the reason we
(Tucana/Kowari/Mulgara) ended up building an RDF database instead of
building on top of an RDBMS: The flexibility in table structure was
less efficient that a system that just "knew" it only had to deal with
3 columns. Obviously the shape of the data (among other things)
dictates what it is the better type of storage to use.

A related point is that processing RDF to create an object means you
have to move around a lot in the graph. This could mean a lot of
seeking on disk, while an RDBMS will usually find the entire object in
one place on the disk. And seeks kill performance.

This leads to the operations used to build objects from an RDF store.
A single object often requires the traversal of several statements,
where the object of one statement becomes the subject of the next.
Since the tables are typically represented as
Subject/Predicate/Object, this means that the main table will be
"joined" against itself. Even RDBMSs are notorious for not doing this
efficiently.

One of the problems with self-joins is that efficient operations like
merge-joins (when they can be identified) will still result in lots of
seeking, since simple iteration on both sides of the join means
seeking around in the same data. Of course, there ARE ways to optimize
some of this, but the various stores are only just starting to get to
these optimizations now.

Relational databases suffer simila

Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-24 Thread Eyal Oren


On 09/24/08/09/08 23:17 +0200, Story Henry wrote:


As a matter of interest, would it be possible to develop RDF stores that 
optimize the layout of the data by analyzing the queries to the  
database? A bit like a Java Just In Time compiler analyses the usage of 
the classes in order to decide how to optimize the compilation.
RDFBroker, by Sintek and Kiesel, did something like this, but analysing the 
data instead of the queries, see [1]. See also the various papers on 
vertical partitioning.


 -eyal

[1] http://www.springerlink.com/content/q313416g113n2257/



Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-24 Thread Story Henry


As a matter of interest, would it be possible to develop RDF stores  
that optimize the layout of the data by analyzing the queries to the  
database? A bit like a Java Just In Time compiler analyses the usage  
of the classes in order to decide how to optimize the compilation.


Henry

On 24 Sep 2008, at 20:30, Paul Gearon wrote:


A related point is that processing RDF to create an object means you
have to move around a lot in the graph. This could mean a lot of
seeking on disk, while an RDBMS will usually find the entire object in
one place on the disk. And seeks kill performance.

This leads to the operations used to build objects from an RDF store.
A single object often requires the traversal of several statements,
where the object of one statement becomes the subject of the next.
Since the tables are typically represented as
Subject/Predicate/Object, this means that the main table will be
"joined" against itself. Even RDBMSs are notorious for not doing this
efficiently.





AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-24 Thread Chris Bizer

Hi Kingsley and Paul,

Yes, I completely agree with you that different storage solutions fit
different use cases and that one of the main strengths of the RDF data model
is its flexibility and the possibility to mix different schemata.

Nevertheless, it think it is useful to give application developers an
indicator about what performance they can expect when they choose a specific
architecture, which is what the benchmark is trying to do.

We plan to run the benchmark again in January and it would be great to also
test Tucana/Kowari/Mulgara in this run.

As the performance of RDF stores is constantly improving, let's also hope
that the picture will not look that bad for them anymore then.

Cheers,

Chris


-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag
von Kingsley Idehen
Gesendet: Mittwoch, 24. September 2008 20:57
An: Paul Gearon
Cc: [EMAIL PROTECTED]; public-lod@w3.org
Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
TDB, D2R Server, and MySQL


Paul Gearon wrote:
> On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote:
>   
>> On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:
>> 
>>>> Has has there been any analysis on whether there is a *fundamental*
>>>> reason for such performance difference? Or is it simply a question of
>>>> "maturity"; in other words, relational db technology has been around
for a
>>>> very long time and is very mature, whereas RDF implementations are
still
>>>> quite recent, so this gap will surely narrow ...?
>>>> 
>>> This is a very complex subject.  I will offer some analysis below, but
>>> this I fear will only raise further questions.  This is not the end of
the
>>> road, far from it.
>>>   
>> As far as I understand, another issue is relevant: this benchmark is
>> somewhat unfair as the relational stores have one advantage compared to
the
>> native triple stores: the relational data structure is fixed (Products,
>> Producers, Reviews, etc with given columns), while the triple
representation
>> is generic (arbitrary s,p,o).
>> 
>
> This point has an effect on several levels.
>
> For instance, the flexibility afforded by triples means that objects
> stored in this structure require processing just to piece it all
> together, whereas the RDBMS has already encoded the structure into the
> table. Ironically, this is exactly the reason we
> (Tucana/Kowari/Mulgara) ended up building an RDF database instead of
> building on top of an RDBMS: The flexibility in table structure was
> less efficient that a system that just "knew" it only had to deal with
> 3 columns. Obviously the shape of the data (among other things)
> dictates what it is the better type of storage to use.
>
> A related point is that processing RDF to create an object means you
> have to move around a lot in the graph. This could mean a lot of
> seeking on disk, while an RDBMS will usually find the entire object in
> one place on the disk. And seeks kill performance.
>
> This leads to the operations used to build objects from an RDF store.
> A single object often requires the traversal of several statements,
> where the object of one statement becomes the subject of the next.
> Since the tables are typically represented as
> Subject/Predicate/Object, this means that the main table will be
> "joined" against itself. Even RDBMSs are notorious for not doing this
> efficiently.
>
> One of the problems with self-joins is that efficient operations like
> merge-joins (when they can be identified) will still result in lots of
> seeking, since simple iteration on both sides of the join means
> seeking around in the same data. Of course, there ARE ways to optimize
> some of this, but the various stores are only just starting to get to
> these optimizations now.
>
> Relational databases suffer similar problems, but joins are usually
> only required for complex structures between different tables, which
> can be stored on different spindles. Contrast this to RDF, which needs
> to do do many of these joins for all but the simplest of data.
>
>   
>> One can question whether such flexibility is relevant in practice, and if
>> so, one may try to extract such structured patterns from data on-the-fly.
>> Still, it's important to note that we're comparing somewhat different
things
>> here between the relational and the triple representation of the
benchmark.
>> 
>
> This is why I think it is very important to consider the type of data
> being stored before choosing the type of storage to use. For some
> applications an RDBMS is going to win hands down e

Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-24 Thread Kingsley Idehen


Paul Gearon wrote:

On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote:
  

On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:


Has has there been any analysis on whether there is a *fundamental*
reason for such performance difference? Or is it simply a question of
"maturity"; in other words, relational db technology has been around for a
very long time and is very mature, whereas RDF implementations are still
quite recent, so this gap will surely narrow ...?


This is a very complex subject.  I will offer some analysis below, but
this I fear will only raise further questions.  This is not the end of the
road, far from it.
  

As far as I understand, another issue is relevant: this benchmark is
somewhat unfair as the relational stores have one advantage compared to the
native triple stores: the relational data structure is fixed (Products,
Producers, Reviews, etc with given columns), while the triple representation
is generic (arbitrary s,p,o).



This point has an effect on several levels.

For instance, the flexibility afforded by triples means that objects
stored in this structure require processing just to piece it all
together, whereas the RDBMS has already encoded the structure into the
table. Ironically, this is exactly the reason we
(Tucana/Kowari/Mulgara) ended up building an RDF database instead of
building on top of an RDBMS: The flexibility in table structure was
less efficient that a system that just "knew" it only had to deal with
3 columns. Obviously the shape of the data (among other things)
dictates what it is the better type of storage to use.

A related point is that processing RDF to create an object means you
have to move around a lot in the graph. This could mean a lot of
seeking on disk, while an RDBMS will usually find the entire object in
one place on the disk. And seeks kill performance.

This leads to the operations used to build objects from an RDF store.
A single object often requires the traversal of several statements,
where the object of one statement becomes the subject of the next.
Since the tables are typically represented as
Subject/Predicate/Object, this means that the main table will be
"joined" against itself. Even RDBMSs are notorious for not doing this
efficiently.

One of the problems with self-joins is that efficient operations like
merge-joins (when they can be identified) will still result in lots of
seeking, since simple iteration on both sides of the join means
seeking around in the same data. Of course, there ARE ways to optimize
some of this, but the various stores are only just starting to get to
these optimizations now.

Relational databases suffer similar problems, but joins are usually
only required for complex structures between different tables, which
can be stored on different spindles. Contrast this to RDF, which needs
to do do many of these joins for all but the simplest of data.

  

One can question whether such flexibility is relevant in practice, and if
so, one may try to extract such structured patterns from data on-the-fly.
Still, it's important to note that we're comparing somewhat different things
here between the relational and the triple representation of the benchmark.



This is why I think it is very important to consider the type of data
being stored before choosing the type of storage to use. For some
applications an RDBMS is going to win hands down every time. For other
applications, an RDF store is definitely the way to go. Understanding
the flexibility and performance constraints of each is important. This
kind of benchmarking helps with that. It also helps identify where RDF
databases need to pick up their act.

Regards,
Paul Gearon


  

Paul,

You make valid points, the problem here is that the benchmark has been 
released without enough clarity about it's prime purpose. To even 
compare RDF Quads Stores with an RDBMS engine when the schema is 
Relational in itself is kinda twisted.


The role of mappers (DR2Q & Virtuoso RDF Views) for instance,  should 
have been made much clearer, maybe in separate results tables. I say 
this because these mappers offer different approaches to projecting 
RDBMS based data in RDF Linked Data form, on the fly, and their purpose 
in this benchmark is all about raw performance and scalability as it 
relates to following RDF Linked Data generation and deployment conditions:


1. Schema is Relational
2. RDF warehouse is impractical

As I am sure you know, we could invert this whole benchmark "Open World" 
style, and then bring RDBMS engines to their knees by incorporating 
SPARQL query patterns comprised of ?p's and subclasses .


To conclude, the quad store numbers should simply be a conparisons of 
the quad stores themselves, and not the quad stores vs the mappers or 
native SQL. This clarification really needs to make it's way into the 
benchmark narrative.



--


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President

Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-24 Thread Paul Gearon

On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote:
>
> On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:
>>>
>>> Has has there been any analysis on whether there is a *fundamental*
>>> reason for such performance difference? Or is it simply a question of
>>> "maturity"; in other words, relational db technology has been around for a
>>> very long time and is very mature, whereas RDF implementations are still
>>> quite recent, so this gap will surely narrow ...?
>>
>> This is a very complex subject.  I will offer some analysis below, but
>> this I fear will only raise further questions.  This is not the end of the
>> road, far from it.
>
> As far as I understand, another issue is relevant: this benchmark is
> somewhat unfair as the relational stores have one advantage compared to the
> native triple stores: the relational data structure is fixed (Products,
> Producers, Reviews, etc with given columns), while the triple representation
> is generic (arbitrary s,p,o).

This point has an effect on several levels.

For instance, the flexibility afforded by triples means that objects
stored in this structure require processing just to piece it all
together, whereas the RDBMS has already encoded the structure into the
table. Ironically, this is exactly the reason we
(Tucana/Kowari/Mulgara) ended up building an RDF database instead of
building on top of an RDBMS: The flexibility in table structure was
less efficient that a system that just "knew" it only had to deal with
3 columns. Obviously the shape of the data (among other things)
dictates what it is the better type of storage to use.

A related point is that processing RDF to create an object means you
have to move around a lot in the graph. This could mean a lot of
seeking on disk, while an RDBMS will usually find the entire object in
one place on the disk. And seeks kill performance.

This leads to the operations used to build objects from an RDF store.
A single object often requires the traversal of several statements,
where the object of one statement becomes the subject of the next.
Since the tables are typically represented as
Subject/Predicate/Object, this means that the main table will be
"joined" against itself. Even RDBMSs are notorious for not doing this
efficiently.

One of the problems with self-joins is that efficient operations like
merge-joins (when they can be identified) will still result in lots of
seeking, since simple iteration on both sides of the join means
seeking around in the same data. Of course, there ARE ways to optimize
some of this, but the various stores are only just starting to get to
these optimizations now.

Relational databases suffer similar problems, but joins are usually
only required for complex structures between different tables, which
can be stored on different spindles. Contrast this to RDF, which needs
to do do many of these joins for all but the simplest of data.

> One can question whether such flexibility is relevant in practice, and if
> so, one may try to extract such structured patterns from data on-the-fly.
> Still, it's important to note that we're comparing somewhat different things
> here between the relational and the triple representation of the benchmark.

This is why I think it is very important to consider the type of data
being stored before choosing the type of storage to use. For some
applications an RDBMS is going to win hands down every time. For other
applications, an RDF store is definitely the way to go. Understanding
the flexibility and performance constraints of each is important. This
kind of benchmarking helps with that. It also helps identify where RDF
databases need to pick up their act.

Regards,
Paul Gearon



Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-23 Thread Ivan Mikhailov

Hello Eyal,

> this benchmark is somewhat unfair as the relational stores have one advantage 
> compared to the 
> native triple stores: the relational data structure is fixed (Products, 
> Producers, Reviews, etc with given columns), while the triple 
> representation is generic (arbitrary s,p,o).
> 
> One can question whether such flexibility is relevant in practice, and if 
> so, one may try to extract such structured patterns from data on-the-fly.

That will be our next big extension -- updateable RDV Views, as proposed
in http://esw.w3.org/topic/UpdatingRelationalDataViaSPARUL . So we will
be able to load BSBM data as RDF and query them via SPARQL web service
endpoint; thus we will masquerade  the relational storage entirely.

Best Regards,

Ivan Mikhailov,
OpenLink Software
http://virtuoso.openlinksw.com





Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-23 Thread Eyal Oren


On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:
Has has there been any analysis on whether there is a *fundamental* 
reason for such performance difference? Or is it simply a question of 
"maturity"; in other words, relational db technology has been around for 
a very long time and is very mature, whereas RDF implementations are 
still quite recent, so this gap will surely narrow ...?
This is a very complex subject.  I will offer some analysis below, but 
this I fear will only raise further questions.  This is not the end of the 
road, far from it.
As far as I understand, another issue is relevant: this benchmark is 
somewhat unfair as the relational stores have one advantage compared to the 
native triple stores: the relational data structure is fixed (Products, 
Producers, Reviews, etc with given columns), while the triple 
representation is generic (arbitrary s,p,o).


One can question whether such flexibility is relevant in practice, and if 
so, one may try to extract such structured patterns from data on-the-fly. 
Still, it's important to note that we're comparing somewhat different 
things here between the relational and the triple representation of the 
benchmark. 


 -eyal

PS: the benchmark is great, really, possible improvements notwithstanding.




RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-19 Thread Orri Erling

 

> ...
> It is interesting to see:
>
> ...
> 4. that the fastest RDF store is still 7 times slower than a 
> relational database.
Has has there been any analysis on whether there is a *fundamental* 
reason for such performance difference?
Or is it simply a question of "maturity" - in other words, relational db 
technology has been around for a very long time and is very mature, 
whereas RDF implementations are still quite recent, so this gap will 
surely narrow ...?

Cheers
D





Dan, All

This is a very complex subject.  I will offer some analysis below, but this
I fear will only raise further questions.  This is not the end 
of the road, far from it.



In the BSBM case, we first note that the relational representation, in this
case with Virtuoso, is about 4x more space 
efficient than the triples representation.
This translates to running in memory in cases where the triples
representation would go to disk.  In the time of getting one page from disk,

let's say 5ms, one can get 1000 random om memory.

With Virtuoso 6, the ratio is somewhat more advantageous to triples.  Still,
a relational row store, such as MySQL or Virtuoso or pretty 
much any other RDBMS, except for the recent crop of column stores, of which
we speak later, can store a non-indexed dependent column in the 
space it takes to store the data.  Not everything is a triple and not
everything gets indexed multiple ways, from 2 to 6 indices, as with 
triples.

Let us further note that the BSBM report does not really define steady
state.  This is our (OpenLink) principal  criticism of the process.
The TPC (Transaction Processing Performance Council) benchmarks make it a
point to  eliminate nondeterminism coming 
from OS disk  cache.  For OLTP, we run for half an hour first to see that
the cache is filled and then measure for another half hour.  For 
analytics, the benchmark set is much larger than memory and the run starts
with a full read through of the biggest tables, which eliminates 
any randomness from OS disk caching.  Plus there may be rules for switching
off the power but I would have to check the papers to be sure of 
this.

Now, the BSBM data sets are primarily in memory.  However, if we start with
a cold cache at 100M scale with Virtuoso relational, , the first 
run is 20 times slower than the second run with the identical queries.  If
we shut down the server and redo the identical run , then the 

performance is about the same than the faster run because the OS still
caches the disk pages.
So, specially at larger scales, the BSBM test process simply must  ensure
steady state for whatever rates are reported.  The easiest way to 

do this is to have a warm-up that is scale factor / 10 query mixes and not a
constant of 32 query mixes, like with the reported results
Virtuoso SPARQL to SQL mapping performs slower than Virtuoso SQL primarily
because of the increased complexity of query compilation.  

However, the numbers published may underestimate this difference because of
not running with a cache in steady state.  In other words, there 
is disk latency which penalizes both equally while this disk latency would
have vanished with another 5 to 10 minutes of running.

But let us talk about triples vs. rows.  The BSBM workload typically
retrieves multiple dependent attributes of a single key.  If these 

attributes are all next to each other, as in a relational row store, then we
have a constant time for the extra attribute instead of a log 
of the database size.  This favors RDBMS's.  As  mentioned before, there are
also columnar RDBMS's, specially for analytics workloads.   

These do not have related attributes next to each other but they can play
tricks relying on dense spacing of row numbers, locality of 
reference, compression of homogenous data,  and sparse indices, which are
not as readily applicable to a more unpredicttable RDF workload.  

This is complex and we do not have space to go deeper into this here.  We
have considered these, as we naturally have contemplated making a 
column store adaptation of Virtuoso.  We may yet make one.

Then there is the element of dissimilar semantics of SPARQL and SQL.  Q6,
which is basically a SQL LIKE full table scan, is specially unfair 

to triple stores.  Even in the SQL world, this would be done using a text
index in any halfway serious RDBMS, since they all have a full 
text predicate.  This hurts RDF seriously but inconveniences SQL somewhat
less because of locality of reference and the fact that LIKE has a 

sinppler semantic than regexp.  Further, when the scale factor goes to
infinity, the ratio of q6 over all queries goes 
 to unity.  In other words, the composition of the metric is not scale
independent:  If the scale is large, Q6 is the only thing that 
matters and is furthermore a thing that really shows triples at their
worstt.



This is recognized by the authors and addressed by dropping Q6 out of the
metric in some cases but this is still not entirely satisfactory 
sinc

Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-19 Thread Daniel Schwabe


Hi,

On 17/09/2008 11:53, Chris Bizer wrote:


Hi all,

over the last weeks, we have extended the Berlin SPARQL Benchmark 
(BSBM) to a multi-client scenario, fine-tuned the benchmark dataset 
and the query mix, and implemented a SQL version of the benchmark in 
order to be able to compare SPARQL stores with classical SQL stores.


Today, we have released the results of running the BSBM Benchmark 
Version 2 against:


+ three RDF stores (Virtuoso Version 5.0.8, Sesame Version 2.2, Jena 
TDB Version 0.53) and
+ two relational database-to-RDF wrappers (D2R Server Version 0.4 and 
Virtuoso - RDF Views Version 5.0.8).


for datasets ranging from 250,000 triples to 100,000,000 triples.

In order to set the SPARQL query performance into context we also 
report the results of running the SQL version of the benchmark against 
two relational database management systems (MySQL 5.1.26 and Virtuoso 
- RDBMS Version 5.0.8).


...
It is interesting to see:

...
4. that the fastest RDF store is still 7 times slower than a 
relational database.
Has has there been any analysis on whether there is a *fundamental* 
reason for such performance difference?
Or is it simply a question of "maturity" - in other words, relational db 
technology has been around for a very long time and is very mature, 
whereas RDF implementations are still quite recent, so this gap will 
surely narrow ...?


Cheers
D

--
Daniel Schwabe
Tel:+55-21-3527 1500 r. 4356
Fax: +55-21-3527 1530
http://www.inf.puc-rio.br/~dschwabe Dept. de Informatica, PUC-Rio
R. M. de S. Vicente, 225
Rio de Janeiro, RJ 22453-900, Brasil




Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

2008-09-17 Thread Chris Bizer


Hi all,

over the last weeks, we have extended the Berlin SPARQL Benchmark 
(BSBM) to a multi-client scenario, fine-tuned the benchmark dataset 
and the query mix, and implemented a SQL version of the benchmark in 
order to be able to compare SPARQL stores with classical SQL stores.


Today, we have released the results of running the BSBM Benchmark 
Version 2 against:


+ three RDF stores (Virtuoso Version 5.0.8, Sesame Version 2.2, Jena 
TDB Version 0.53) and
+ two relational database-to-RDF wrappers (D2R Server Version 0.4 and 
Virtuoso - RDF Views Version 5.0.8).


for datasets ranging from 250,000 triples to 100,000,000 triples.

In order to set the SPARQL query performance into context we also 
report the results of running the SQL version of the benchmark against 
two relational database management systems (MySQL 5.1.26 and 
Virtuoso - RDBMS Version 5.0.8).


A comparison of the performance for a single client working against 
the stores is found here:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#comparison

A comparison of the performance for 1 to 16 clients simultaneously 
executing query mixes against the stores is found here:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#multiResults

The complete benchmark results including the setup of the experiment 
and the configuration of the different stores is found here:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html

The current specification of the Berlin SPARQL Benchmark is found 
here:


http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/20080912/

It is interesting to see:

1. that relational database to RDF wrappers generally outperform RDF 
stores for larger dataset sizes.
2. that no store outperforms the others for all queries and dataset 
sizes.
3. that the query throughput still varies widely within the 
multi-client scenario.
4. that the fastest RDF store is still 7 times slower than a 
relational database.


Thanks a lot to

+ Eli Lilly and Company and especially Susie Stephens for making this 
work possible through a research grant.
+ Orri Erling, Andy Seaborne, Arjohn Kampman, Michael Schmidt, Richard 
Cyganiak, Ivan Mikhailov, Patrick van Kleef, and Christian Becker for 
their feedback on the benchmark design and their help with configuring 
the stores and running the benchmark experiment.


Without all your help it would not been possible to conduct this 
experiment.


We highly welcome feedback on the benchmark design and the results of 
the experiment.


Cheers,

Chris Bizer and Andreas Schultz

--
Prof. Dr. Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 55509
Mail: [EMAIL PROTECTED]
Web: www.bizer.de