Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
François-Paul Servant wrote: Kingsley Idehen a écrit : Chris Bizer wrote: ... Depending on how much data integration you need, you then start to apply some identity resolution and schema mapping techniques. We have been talking to some pharma and media companies that do data warehousing for years and they all seam to be very interested in this quick and dirty approach. "Quick & Dirty" is simply not how I would characterize this matter. I prefer to describe this as step 1 in a multi phased approach to RDF based data integration. Kingsley, I agree. I find it clean to make the basic but necessary first steps towards data integration (identification of things, mapping, etc.). Building upon legacy systems, that you just have to adapt, doesn't make the approach dirty: it makes it possible ;-) Francois, Yes. In my world view, "quick & dirty" is synonymous with inherent inability to scale across quality vectors such as: 1. Integration use-case complexity 2. Data volume growth 3. Concurrent usage growth 4. Deployment target growth When encounter "quick & dirty" associated with anything in the technology realm, the items above send off deafening alarm bells in my head :-) Kingsley cheers, fps -- Disclaimer Ce message ainsi que les eventuelles pieces jointes constituent une correspondance privee et confidentielle a l'attention exclusive du destinataire designe ci-dessus. Si vous n'etes pas le destinataire du present message ou une personne susceptible de pouvoir le lui delivrer, il vous est signifie que toute divulgation, distribution ou copie de cette transmission est strictement interdite. Si vous avez recu ce message par erreur, nous vous remercions d'en informer l'expediteur par telephone ou de lui retourner le present message, puis d'effacer immediatement ce message de votre systeme. *** This e-mail and any attachments is a confidential correspondence intended only for use of the individual or entity named above. If you are not the intended recipient or the agent responsible for delivering the message to the intended recipient, you are hereby notified that any disclosure, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender by phone or by replying this message, and then delete this message from your system. -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President & CEO OpenLink Software Web: http://www.openlinksw.com
Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Kingsley Idehen a écrit : Chris Bizer wrote: ... Depending on how much data integration you need, you then start to apply some identity resolution and schema mapping techniques. We have been talking to some pharma and media companies that do data warehousing for years and they all seam to be very interested in this quick and dirty approach. "Quick & Dirty" is simply not how I would characterize this matter. I prefer to describe this as step 1 in a multi phased approach to RDF based data integration. Kingsley, I agree. I find it clean to make the basic but necessary first steps towards data integration (identification of things, mapping, etc.). Building upon legacy systems, that you just have to adapt, doesn't make the approach dirty: it makes it possible ;-) cheers, fps -- Disclaimer Ce message ainsi que les eventuelles pieces jointes constituent une correspondance privee et confidentielle a l'attention exclusive du destinataire designe ci-dessus. Si vous n'etes pas le destinataire du present message ou une personne susceptible de pouvoir le lui delivrer, il vous est signifie que toute divulgation, distribution ou copie de cette transmission est strictement interdite. Si vous avez recu ce message par erreur, nous vous remercions d'en informer l'expediteur par telephone ou de lui retourner le present message, puis d'effacer immediatement ce message de votre systeme. *** This e-mail and any attachments is a confidential correspondence intended only for use of the individual or entity named above. If you are not the intended recipient or the agent responsible for delivering the message to the intended recipient, you are hereby notified that any disclosure, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender by phone or by replying this message, and then delete this message from your system.
Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Chris Bizer wrote: Hi Orri, It is my feeling that RDF has a dual role: 1. interchange format: This is like what XML does, except that RDF has more semantics and expressivity. 2: Database storage format for cases where data must be integrated and is too heterogenous to easily fall into one relational schema. This is for example the case in the open web conversation and social space. The first case is for mapping, the second for warehousing. Aside this, there is potential for more expressive queries through the query language dealing with inferencing, like subclass/subproperty/transitive etc. These do not go very well with SQL views. I cannot agree more with what you say :-) We are seeing the first RDF use case emerge within initiatives like the Linking Open Data effort, where beside of being more expressive, RDF is also playing its strength to provide for data links between record in different databases. Chris, Talking with people from industry, I get the feeling that also more and more people understand the second use case and that RDF is increasingly used as a technology for something like "poor man's data integration". You don't have to spend a lot of time and money one designing a comprehensive data warehouse. You just throw data having different schemata from different sources together and instantly get the benefit that you can browse and query the data and that you have proper provenance tracking (using Named Graphs). Depending on how much data integration you need, you then start to apply some identity resolution and schema mapping techniques. We have been talking to some pharma and media companies that do data warehousing for years and they all seam to be very interested in this quick and dirty approach. "Quick & Dirty" is simply not how I would characterize this matter. I prefer to describe this as step 1 in a multi phased approach to RDF based data integration. For both use cases, inferencing is a nice add-on but not essential. Within the first use case, inferencing usually does not work as data published by various autonomous sources tends to be to dirty for reasoning engines. Inferencing is not a nice add-on, it is essential (in various degrees) once you get beyond the initial stages of heterogeneous data integration. As with all things, these matters are connected and inherently symbiotic: you can't inference without having something you want to reason about available in palatable form, which goes back to the phased approach I refer to above. In my eyes, and experience, RDF is a powerful vehicle for implementing conceptual level data access that sits atop heterogeneous data sources. It's novelty comes from the platform independence that it injects into the data integration technology realm. Kingsley Cheers, Chris -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Orri Erling Gesendet: Dienstag, 30. September 2008 00:16 An: 'Seaborne, Andy'; 'Story Henry' Cc: [EMAIL PROTECTED] Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL >From Henry Story: As a matter of interest, would it be possible to develop RDF stores that optimize the layout of the data by analyzing the queries to the database? A bit like a Java Just In Time compiler analyses the usage of the classes in order to decide how to optimize the compilation. >From Andy Seaborne: On a similar note, by mining the query logs it would be possible to create parameterised queries and associated plan fragments without the client needing to notify the server of the templates. Couple with automatically calculating possible materialized views or other layout optimizations, the poor, overworked client application writer doesn't get brought into optimizing the server. Andy Orri here: With the BSBM workload, using parametrized queries as a small scale saves roughly 1/3 of the execution time. It is possible to remember query plans and to notice if the same query text is submitted with only changes in literal values. If the first query ran quickly, one may presume the query with substitutions will also run quickly. There are of course exceptions. But detecting these will mean running most of the optimizer cost model and will eliminate any benefit from caching. The other optimizations suggested have a larger upside but are far harder. I would say that if we have a predictable workload, then mapping relational to RDF is a lot easier than expecting the DBMS to figure out materialized views to do the same. If we do not have a predictable workload, then making too many materialized views based on transient usage patterns is a large downside because it grows the database, meaning less working set. The difference between in memory random access and a
Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hi Orri, > It is my feeling that RDF has a dual role: > 1. interchange format: This is like what XML does, except that RDF has more semantics and expressivity. > 2: Database storage format for cases where data must be integrated and is too heterogenous to easily > fall into one relational schema. This is for example the case in the open web conversation and social > space. The first case is for mapping, the second for warehousing. > Aside this, there is potential for more expressive queries through the query language dealing with > inferencing, like subclass/subproperty/transitive etc. These do not go very well with SQL views. I cannot agree more with what you say :-) We are seeing the first RDF use case emerge within initiatives like the Linking Open Data effort, where beside of being more expressive, RDF is also playing its strength to provide for data links between record in different databases. Talking with people from industry, I get the feeling that also more and more people understand the second use case and that RDF is increasingly used as a technology for something like "poor man's data integration". You don't have to spend a lot of time and money one designing a comprehensive data warehouse. You just throw data having different schemata from different sources together and instantly get the benefit that you can browse and query the data and that you have proper provenance tracking (using Named Graphs). Depending on how much data integration you need, you then start to apply some identity resolution and schema mapping techniques. We have been talking to some pharma and media companies that do data warehousing for years and they all seam to be very interested in this quick and dirty approach. For both use cases, inferencing is a nice add-on but not essential. Within the first use case, inferencing usually does not work as data published by various autonomous sources tends to be to dirty for reasoning engines. Cheers, Chris -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Orri Erling Gesendet: Dienstag, 30. September 2008 00:16 An: 'Seaborne, Andy'; 'Story Henry' Cc: [EMAIL PROTECTED] Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL >From Henry Story: > > > As a matter of interest, would it be possible to develop RDF stores > that optimize the layout of the data by analyzing the queries to the > database? A bit like a Java Just In Time compiler analyses the usage > of the classes in order to decide how to optimize the compilation. >From Andy Seaborne: On a similar note, by mining the query logs it would be possible to create parameterised queries and associated plan fragments without the client needing to notify the server of the templates. Couple with automatically calculating possible materialized views or other layout optimizations, the poor, overworked client application writer doesn't get brought into optimizing the server. Andy > Orri here: With the BSBM workload, using parametrized queries as a small scale saves roughly 1/3 of the execution time. It is possible to remember query plans and to notice if the same query text is submitted with only changes in literal values. If the first query ran quickly, one may presume the query with substitutions will also run quickly. There are of course exceptions. But detecting these will mean running most of the optimizer cost model and will eliminate any benefit from caching. The other optimizations suggested have a larger upside but are far harder. I would say that if we have a predictable workload, then mapping relational to RDF is a lot easier than expecting the DBMS to figure out materialized views to do the same. If we do not have a predictable workload, then making too many materialized views based on transient usage patterns is a large downside because it grows the database, meaning less working set. The difference between in memory random access and a random access with disk is about 5000 times. Plus there is a high cost to making the views, thus a high penalty for wrong guess. Andif it is hard enough to figure out where a query plan goes wrong with a given schema, it is harder still to figure it out with a schema that morphs by itself. In the RDB world, for example Oracle recommends saving optimizer statistics from the test environment and using these in the production environment just so the optimizer does not get creative. Now this is the essence of wisdom for OLTP but we are not talking OLTP with RDF. If there is a history of usage and this history is steady and the dba can confirm it as being a representative sample, then automatic materializing of joins is a real possibility. Doing this spontaneously would lead to erratic response times, though. For anything online, the accent i
Re: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Chris Bizer wrote: Hi Kingsley and Paul, Yes, I completely agree with you that different storage solutions fit different use cases and that one of the main strengths of the RDF data model is its flexibility and the possibility to mix different schemata. Nevertheless, it think it is useful to give application developers an indicator about what performance they can expect when they choose a specific architecture, which is what the benchmark is trying to do. We plan to run the benchmark again in January and it would be great to also test Tucana/Kowari/Mulgara in this run. As the performance of RDF stores is constantly improving, let's also hope that the picture will not look that bad for them anymore then. Cheers, Chris Chris, Yes, but the user profile has to be a little clearer. If you separate the results in the narrative you achieve the goal. You can use SQL numbers as a sort of benchamark if you clearly explain the nature skew that SQL enjoys due to the nature of the schema. We plan to run the benchmark again in January and it would be great to also test Tucana/Kowari/Mulgara in this run. As the performance of RDF stores is constantly improving, let's also hope that the picture will not look that bad for them anymore then. But at the current time, there is no clear sense of what better means :-) What's the goal? What I fundamentally take from the benchmarks are the following: 1. Native RDF and RDF Views/Mapper scalability is becoming less of an issue (of course depending on your choice of product) and we are already at the point where this technology can be used for real-world solutions that have enterprise level scalability demands and expectations 2. It's impractical to create RDF warehouses from a existing SQL Data Sources when you can put RDF Views / Wrappers in front of the SQL Data Sources (SQL cost optimization technology has evolved significantly over the years across RDBMS engines). And Yes, I would also like to see Mulgara and others RDF Stores in the next round of benchmarks :-) Kingsley -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Kingsley Idehen Gesendet: Mittwoch, 24. September 2008 20:57 An: Paul Gearon Cc: [EMAIL PROTECTED]; public-lod@w3.org Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL Paul Gearon wrote: On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote: On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of "maturity"; in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. As far as I understand, another issue is relevant: this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). This point has an effect on several levels. For instance, the flexibility afforded by triples means that objects stored in this structure require processing just to piece it all together, whereas the RDBMS has already encoded the structure into the table. Ironically, this is exactly the reason we (Tucana/Kowari/Mulgara) ended up building an RDF database instead of building on top of an RDBMS: The flexibility in table structure was less efficient that a system that just "knew" it only had to deal with 3 columns. Obviously the shape of the data (among other things) dictates what it is the better type of storage to use. A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be "joined" against itself. Even RDBMSs are notorious for not doing this efficiently. One of the problems with self-joins is that efficient operations like merge-joins (when they can be identified) will still result in lots of s
Re: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Chris Bizer wrote: Hi Kingsley and Paul, Yes, I completely agree with you that different storage solutions fit different use cases and that one of the main strengths of the RDF data model is its flexibility and the possibility to mix different schemata. Nevertheless, it think it is useful to give application developers an indicator about what performance they can expect when they choose a specific architecture, which is what the benchmark is trying to do. Chris, Yes, but the user profile has to be a little clearer. If you separate the results in the narrative you achieve the goal. You can use SQL numbers as a sort of benchamark if you clearly explain the nature skew that SQL enjoys due to the nature of the schema. We plan to run the benchmark again in January and it would be great to also test Tucana/Kowari/Mulgara in this run. As the performance of RDF stores is constantly improving, let's also hope that the picture will not look that bad for them anymore then. But at the current time, there is no clear sense of what better means :-) What's the goal? What I fundamentally take from the benchmarks are the following: 1. Native RDF and RDF Views/Mapper scalability is becoming less of an issue (of course depending on your choice of product) and we are already at the point where this technology can be used for real-world solutions that have enterprise level scalability demands and expectations 2. It's impractical to create RDF warehouses from a existing SQL Data Sources when you can put RDF Views / Wrappers in front of the SQL Data Sources (SQL cost optimization technology has evolved significantly over the years across RDBMS engines). And Yes, I would also like to see Mulgara and others RDF Stores in the next round of benchmarks :-) Kingsley Cheers, Chris -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Kingsley Idehen Gesendet: Mittwoch, 24. September 2008 20:57 An: Paul Gearon Cc: [EMAIL PROTECTED]; public-lod@w3.org Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL Paul Gearon wrote: On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote: On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of "maturity"; in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. As far as I understand, another issue is relevant: this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). This point has an effect on several levels. For instance, the flexibility afforded by triples means that objects stored in this structure require processing just to piece it all together, whereas the RDBMS has already encoded the structure into the table. Ironically, this is exactly the reason we (Tucana/Kowari/Mulgara) ended up building an RDF database instead of building on top of an RDBMS: The flexibility in table structure was less efficient that a system that just "knew" it only had to deal with 3 columns. Obviously the shape of the data (among other things) dictates what it is the better type of storage to use. A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be "joined" against itself. Even RDBMSs are notorious for not doing this efficiently. One of the problems with self-joins is that efficient operations like merge-joins (when they can be identified) will still result in lots of seeking, since simple iteration on both sides of the join means seeking around in the same data. Of course, there ARE ways to optimize some of this, but the various stores are only just starting to get to these optimizations now. Relational databases suffer simila
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
On 09/24/08/09/08 23:17 +0200, Story Henry wrote: As a matter of interest, would it be possible to develop RDF stores that optimize the layout of the data by analyzing the queries to the database? A bit like a Java Just In Time compiler analyses the usage of the classes in order to decide how to optimize the compilation. RDFBroker, by Sintek and Kiesel, did something like this, but analysing the data instead of the queries, see [1]. See also the various papers on vertical partitioning. -eyal [1] http://www.springerlink.com/content/q313416g113n2257/
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
As a matter of interest, would it be possible to develop RDF stores that optimize the layout of the data by analyzing the queries to the database? A bit like a Java Just In Time compiler analyses the usage of the classes in order to decide how to optimize the compilation. Henry On 24 Sep 2008, at 20:30, Paul Gearon wrote: A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be "joined" against itself. Even RDBMSs are notorious for not doing this efficiently.
AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hi Kingsley and Paul, Yes, I completely agree with you that different storage solutions fit different use cases and that one of the main strengths of the RDF data model is its flexibility and the possibility to mix different schemata. Nevertheless, it think it is useful to give application developers an indicator about what performance they can expect when they choose a specific architecture, which is what the benchmark is trying to do. We plan to run the benchmark again in January and it would be great to also test Tucana/Kowari/Mulgara in this run. As the performance of RDF stores is constantly improving, let's also hope that the picture will not look that bad for them anymore then. Cheers, Chris -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Kingsley Idehen Gesendet: Mittwoch, 24. September 2008 20:57 An: Paul Gearon Cc: [EMAIL PROTECTED]; public-lod@w3.org Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL Paul Gearon wrote: > On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote: > >> On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: >> >>>> Has has there been any analysis on whether there is a *fundamental* >>>> reason for such performance difference? Or is it simply a question of >>>> "maturity"; in other words, relational db technology has been around for a >>>> very long time and is very mature, whereas RDF implementations are still >>>> quite recent, so this gap will surely narrow ...? >>>> >>> This is a very complex subject. I will offer some analysis below, but >>> this I fear will only raise further questions. This is not the end of the >>> road, far from it. >>> >> As far as I understand, another issue is relevant: this benchmark is >> somewhat unfair as the relational stores have one advantage compared to the >> native triple stores: the relational data structure is fixed (Products, >> Producers, Reviews, etc with given columns), while the triple representation >> is generic (arbitrary s,p,o). >> > > This point has an effect on several levels. > > For instance, the flexibility afforded by triples means that objects > stored in this structure require processing just to piece it all > together, whereas the RDBMS has already encoded the structure into the > table. Ironically, this is exactly the reason we > (Tucana/Kowari/Mulgara) ended up building an RDF database instead of > building on top of an RDBMS: The flexibility in table structure was > less efficient that a system that just "knew" it only had to deal with > 3 columns. Obviously the shape of the data (among other things) > dictates what it is the better type of storage to use. > > A related point is that processing RDF to create an object means you > have to move around a lot in the graph. This could mean a lot of > seeking on disk, while an RDBMS will usually find the entire object in > one place on the disk. And seeks kill performance. > > This leads to the operations used to build objects from an RDF store. > A single object often requires the traversal of several statements, > where the object of one statement becomes the subject of the next. > Since the tables are typically represented as > Subject/Predicate/Object, this means that the main table will be > "joined" against itself. Even RDBMSs are notorious for not doing this > efficiently. > > One of the problems with self-joins is that efficient operations like > merge-joins (when they can be identified) will still result in lots of > seeking, since simple iteration on both sides of the join means > seeking around in the same data. Of course, there ARE ways to optimize > some of this, but the various stores are only just starting to get to > these optimizations now. > > Relational databases suffer similar problems, but joins are usually > only required for complex structures between different tables, which > can be stored on different spindles. Contrast this to RDF, which needs > to do do many of these joins for all but the simplest of data. > > >> One can question whether such flexibility is relevant in practice, and if >> so, one may try to extract such structured patterns from data on-the-fly. >> Still, it's important to note that we're comparing somewhat different things >> here between the relational and the triple representation of the benchmark. >> > > This is why I think it is very important to consider the type of data > being stored before choosing the type of storage to use. For some > applications an RDBMS is going to win hands down e
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Paul Gearon wrote: On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote: On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of "maturity"; in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. As far as I understand, another issue is relevant: this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). This point has an effect on several levels. For instance, the flexibility afforded by triples means that objects stored in this structure require processing just to piece it all together, whereas the RDBMS has already encoded the structure into the table. Ironically, this is exactly the reason we (Tucana/Kowari/Mulgara) ended up building an RDF database instead of building on top of an RDBMS: The flexibility in table structure was less efficient that a system that just "knew" it only had to deal with 3 columns. Obviously the shape of the data (among other things) dictates what it is the better type of storage to use. A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be "joined" against itself. Even RDBMSs are notorious for not doing this efficiently. One of the problems with self-joins is that efficient operations like merge-joins (when they can be identified) will still result in lots of seeking, since simple iteration on both sides of the join means seeking around in the same data. Of course, there ARE ways to optimize some of this, but the various stores are only just starting to get to these optimizations now. Relational databases suffer similar problems, but joins are usually only required for complex structures between different tables, which can be stored on different spindles. Contrast this to RDF, which needs to do do many of these joins for all but the simplest of data. One can question whether such flexibility is relevant in practice, and if so, one may try to extract such structured patterns from data on-the-fly. Still, it's important to note that we're comparing somewhat different things here between the relational and the triple representation of the benchmark. This is why I think it is very important to consider the type of data being stored before choosing the type of storage to use. For some applications an RDBMS is going to win hands down every time. For other applications, an RDF store is definitely the way to go. Understanding the flexibility and performance constraints of each is important. This kind of benchmarking helps with that. It also helps identify where RDF databases need to pick up their act. Regards, Paul Gearon Paul, You make valid points, the problem here is that the benchmark has been released without enough clarity about it's prime purpose. To even compare RDF Quads Stores with an RDBMS engine when the schema is Relational in itself is kinda twisted. The role of mappers (DR2Q & Virtuoso RDF Views) for instance, should have been made much clearer, maybe in separate results tables. I say this because these mappers offer different approaches to projecting RDBMS based data in RDF Linked Data form, on the fly, and their purpose in this benchmark is all about raw performance and scalability as it relates to following RDF Linked Data generation and deployment conditions: 1. Schema is Relational 2. RDF warehouse is impractical As I am sure you know, we could invert this whole benchmark "Open World" style, and then bring RDBMS engines to their knees by incorporating SPARQL query patterns comprised of ?p's and subclasses . To conclude, the quad store numbers should simply be a conparisons of the quad stores themselves, and not the quad stores vs the mappers or native SQL. This clarification really needs to make it's way into the benchmark narrative. -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <[EMAIL PROTECTED]> wrote: > > On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: >>> >>> Has has there been any analysis on whether there is a *fundamental* >>> reason for such performance difference? Or is it simply a question of >>> "maturity"; in other words, relational db technology has been around for a >>> very long time and is very mature, whereas RDF implementations are still >>> quite recent, so this gap will surely narrow ...? >> >> This is a very complex subject. I will offer some analysis below, but >> this I fear will only raise further questions. This is not the end of the >> road, far from it. > > As far as I understand, another issue is relevant: this benchmark is > somewhat unfair as the relational stores have one advantage compared to the > native triple stores: the relational data structure is fixed (Products, > Producers, Reviews, etc with given columns), while the triple representation > is generic (arbitrary s,p,o). This point has an effect on several levels. For instance, the flexibility afforded by triples means that objects stored in this structure require processing just to piece it all together, whereas the RDBMS has already encoded the structure into the table. Ironically, this is exactly the reason we (Tucana/Kowari/Mulgara) ended up building an RDF database instead of building on top of an RDBMS: The flexibility in table structure was less efficient that a system that just "knew" it only had to deal with 3 columns. Obviously the shape of the data (among other things) dictates what it is the better type of storage to use. A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be "joined" against itself. Even RDBMSs are notorious for not doing this efficiently. One of the problems with self-joins is that efficient operations like merge-joins (when they can be identified) will still result in lots of seeking, since simple iteration on both sides of the join means seeking around in the same data. Of course, there ARE ways to optimize some of this, but the various stores are only just starting to get to these optimizations now. Relational databases suffer similar problems, but joins are usually only required for complex structures between different tables, which can be stored on different spindles. Contrast this to RDF, which needs to do do many of these joins for all but the simplest of data. > One can question whether such flexibility is relevant in practice, and if > so, one may try to extract such structured patterns from data on-the-fly. > Still, it's important to note that we're comparing somewhat different things > here between the relational and the triple representation of the benchmark. This is why I think it is very important to consider the type of data being stored before choosing the type of storage to use. For some applications an RDBMS is going to win hands down every time. For other applications, an RDF store is definitely the way to go. Understanding the flexibility and performance constraints of each is important. This kind of benchmarking helps with that. It also helps identify where RDF databases need to pick up their act. Regards, Paul Gearon
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hello Eyal, > this benchmark is somewhat unfair as the relational stores have one advantage > compared to the > native triple stores: the relational data structure is fixed (Products, > Producers, Reviews, etc with given columns), while the triple > representation is generic (arbitrary s,p,o). > > One can question whether such flexibility is relevant in practice, and if > so, one may try to extract such structured patterns from data on-the-fly. That will be our next big extension -- updateable RDV Views, as proposed in http://esw.w3.org/topic/UpdatingRelationalDataViaSPARUL . So we will be able to load BSBM data as RDF and query them via SPARQL web service endpoint; thus we will masquerade the relational storage entirely. Best Regards, Ivan Mikhailov, OpenLink Software http://virtuoso.openlinksw.com
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of "maturity"; in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. As far as I understand, another issue is relevant: this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). One can question whether such flexibility is relevant in practice, and if so, one may try to extract such structured patterns from data on-the-fly. Still, it's important to note that we're comparing somewhat different things here between the relational and the triple representation of the benchmark. -eyal PS: the benchmark is great, really, possible improvements notwithstanding.
RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
> ... > It is interesting to see: > > ... > 4. that the fastest RDF store is still 7 times slower than a > relational database. Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of "maturity" - in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? Cheers D Dan, All This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. In the BSBM case, we first note that the relational representation, in this case with Virtuoso, is about 4x more space efficient than the triples representation. This translates to running in memory in cases where the triples representation would go to disk. In the time of getting one page from disk, let's say 5ms, one can get 1000 random om memory. With Virtuoso 6, the ratio is somewhat more advantageous to triples. Still, a relational row store, such as MySQL or Virtuoso or pretty much any other RDBMS, except for the recent crop of column stores, of which we speak later, can store a non-indexed dependent column in the space it takes to store the data. Not everything is a triple and not everything gets indexed multiple ways, from 2 to 6 indices, as with triples. Let us further note that the BSBM report does not really define steady state. This is our (OpenLink) principal criticism of the process. The TPC (Transaction Processing Performance Council) benchmarks make it a point to eliminate nondeterminism coming from OS disk cache. For OLTP, we run for half an hour first to see that the cache is filled and then measure for another half hour. For analytics, the benchmark set is much larger than memory and the run starts with a full read through of the biggest tables, which eliminates any randomness from OS disk caching. Plus there may be rules for switching off the power but I would have to check the papers to be sure of this. Now, the BSBM data sets are primarily in memory. However, if we start with a cold cache at 100M scale with Virtuoso relational, , the first run is 20 times slower than the second run with the identical queries. If we shut down the server and redo the identical run , then the performance is about the same than the faster run because the OS still caches the disk pages. So, specially at larger scales, the BSBM test process simply must ensure steady state for whatever rates are reported. The easiest way to do this is to have a warm-up that is scale factor / 10 query mixes and not a constant of 32 query mixes, like with the reported results Virtuoso SPARQL to SQL mapping performs slower than Virtuoso SQL primarily because of the increased complexity of query compilation. However, the numbers published may underestimate this difference because of not running with a cache in steady state. In other words, there is disk latency which penalizes both equally while this disk latency would have vanished with another 5 to 10 minutes of running. But let us talk about triples vs. rows. The BSBM workload typically retrieves multiple dependent attributes of a single key. If these attributes are all next to each other, as in a relational row store, then we have a constant time for the extra attribute instead of a log of the database size. This favors RDBMS's. As mentioned before, there are also columnar RDBMS's, specially for analytics workloads. These do not have related attributes next to each other but they can play tricks relying on dense spacing of row numbers, locality of reference, compression of homogenous data, and sparse indices, which are not as readily applicable to a more unpredicttable RDF workload. This is complex and we do not have space to go deeper into this here. We have considered these, as we naturally have contemplated making a column store adaptation of Virtuoso. We may yet make one. Then there is the element of dissimilar semantics of SPARQL and SQL. Q6, which is basically a SQL LIKE full table scan, is specially unfair to triple stores. Even in the SQL world, this would be done using a text index in any halfway serious RDBMS, since they all have a full text predicate. This hurts RDF seriously but inconveniences SQL somewhat less because of locality of reference and the fact that LIKE has a sinppler semantic than regexp. Further, when the scale factor goes to infinity, the ratio of q6 over all queries goes to unity. In other words, the composition of the metric is not scale independent: If the scale is large, Q6 is the only thing that matters and is furthermore a thing that really shows triples at their worstt. This is recognized by the authors and addressed by dropping Q6 out of the metric in some cases but this is still not entirely satisfactory sinc
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hi, On 17/09/2008 11:53, Chris Bizer wrote: Hi all, over the last weeks, we have extended the Berlin SPARQL Benchmark (BSBM) to a multi-client scenario, fine-tuned the benchmark dataset and the query mix, and implemented a SQL version of the benchmark in order to be able to compare SPARQL stores with classical SQL stores. Today, we have released the results of running the BSBM Benchmark Version 2 against: + three RDF stores (Virtuoso Version 5.0.8, Sesame Version 2.2, Jena TDB Version 0.53) and + two relational database-to-RDF wrappers (D2R Server Version 0.4 and Virtuoso - RDF Views Version 5.0.8). for datasets ranging from 250,000 triples to 100,000,000 triples. In order to set the SPARQL query performance into context we also report the results of running the SQL version of the benchmark against two relational database management systems (MySQL 5.1.26 and Virtuoso - RDBMS Version 5.0.8). ... It is interesting to see: ... 4. that the fastest RDF store is still 7 times slower than a relational database. Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of "maturity" - in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? Cheers D -- Daniel Schwabe Tel:+55-21-3527 1500 r. 4356 Fax: +55-21-3527 1530 http://www.inf.puc-rio.br/~dschwabe Dept. de Informatica, PUC-Rio R. M. de S. Vicente, 225 Rio de Janeiro, RJ 22453-900, Brasil
Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hi all, over the last weeks, we have extended the Berlin SPARQL Benchmark (BSBM) to a multi-client scenario, fine-tuned the benchmark dataset and the query mix, and implemented a SQL version of the benchmark in order to be able to compare SPARQL stores with classical SQL stores. Today, we have released the results of running the BSBM Benchmark Version 2 against: + three RDF stores (Virtuoso Version 5.0.8, Sesame Version 2.2, Jena TDB Version 0.53) and + two relational database-to-RDF wrappers (D2R Server Version 0.4 and Virtuoso - RDF Views Version 5.0.8). for datasets ranging from 250,000 triples to 100,000,000 triples. In order to set the SPARQL query performance into context we also report the results of running the SQL version of the benchmark against two relational database management systems (MySQL 5.1.26 and Virtuoso - RDBMS Version 5.0.8). A comparison of the performance for a single client working against the stores is found here: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#comparison A comparison of the performance for 1 to 16 clients simultaneously executing query mixes against the stores is found here: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#multiResults The complete benchmark results including the setup of the experiment and the configuration of the different stores is found here: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html The current specification of the Berlin SPARQL Benchmark is found here: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/20080912/ It is interesting to see: 1. that relational database to RDF wrappers generally outperform RDF stores for larger dataset sizes. 2. that no store outperforms the others for all queries and dataset sizes. 3. that the query throughput still varies widely within the multi-client scenario. 4. that the fastest RDF store is still 7 times slower than a relational database. Thanks a lot to + Eli Lilly and Company and especially Susie Stephens for making this work possible through a research grant. + Orri Erling, Andy Seaborne, Arjohn Kampman, Michael Schmidt, Richard Cyganiak, Ivan Mikhailov, Patrick van Kleef, and Christian Becker for their feedback on the benchmark design and their help with configuring the stores and running the benchmark experiment. Without all your help it would not been possible to conduct this experiment. We highly welcome feedback on the benchmark design and the results of the experiment. Cheers, Chris Bizer and Andreas Schultz -- Prof. Dr. Chris Bizer Freie Universität Berlin Phone: +49 30 838 55509 Mail: [EMAIL PROTECTED] Web: www.bizer.de