Re: [CODE4LIB] triple stores ???
Hey Ravi, I actually learned about TinkerPop from a posting on this list from Brian Tingle and I started playing with it and eventually started working with it for the digital repository we're building. I actually began with the Sail extension, but scaled back to non-RDF model on that after realizing it wasn't really a requirement that our users were asking for...simply using it as a graph db makes things go a lot quicker. Of course, we're planning on revisiting implementing the RDF-based model in the future. In TinkerPop/Neo4j, there's ACID transaction support and there a bulk loading utility. However, there is not transaction support in the bulk loader...also, I think Sail has kinda different/weird transaction support, but I think you can override that in Tinkerpop somehow... b,chris. On Wed, May 30, 2012 at 2:29 AM, Simon Spero sesunc...@gmail.com wrote: The latest version of Jena TDB adds atomic transactions (version 0.9.0+) See http://jena.apache.org/documentation/tdb/tdb_transactions.html for documentation: The following limitations are listed: - Bulk loads: the TDB bulk loader is not transactional - Nested transactions are not supported. - Some active transaction state is held exclusively in-memory, limiting scalability. - Long-running transactions. Read-transactions cause a build-up of pending changes; - If a single read transaction runs for a long time when there are many updates, the system will consume a lot of temporary resources. On Tue, May 29, 2012 at 7:00 PM, Fleming, Declan dflem...@ucsd.edu wrote: Hi Ravi - I'll let some of my more technical folks chime in, but we do a bunch with RDF and have found every triplestore we've tried very limited in handling transactions. Reading and writing at the same time causes a deadlock that's a mess to keep clean. So, we went back where we started and created a triplestore using SQL with big tables of triples. We cheat a little bit with a fourth column for ID and a fifth that helps speed up blank node searching. This has helped us avoid these transactional problems we were having, and the performance is quite good for ingest. Most of our searching is done by stuffing the triples into solr in a JSON format, so we don't rely on the backend data store for that much. We also sync the SQL triples to Allegrograph in case we need deeper SPARQL things, but we're thinking of shedding this from our architecture. Declan -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ravi Shankar Sent: Tuesday, May 29, 2012 12:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] triple stores ??? We (DLSS at Stanford Libraries) are planning to use a triple store for storing and retrieving annotations (in RDF) on digital objects. We are currently looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and Mulgara. Are you currently using a triple store or contemplating on using one? How would you evaluate 'your' triple store along the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 5) documentation and 6) community support? Highly appreciate your thoughts, ideas and suggestions. Thanks, Ravi Shankar
Re: [CODE4LIB] triple stores ???
Maybe a G search can help to find comparisons: http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara The result includes your post... added 8 minutes ago. Stefano On 29/mag/2012, at 09.12, Ravi Shankar wrote: We (DLSS at Stanford Libraries) are planning to use a triple store for storing and retrieving annotations (in RDF) on digital objects. We are currently looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and Mulgara. Are you currently using a triple store or contemplating on using one? How would you evaluate 'your' triple store along the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 5) documentation and 6) community support? Highly appreciate your thoughts, ideas and suggestions. Thanks, Ravi Shankar __ Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto semplice ma di grande valore. Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di tutti noi. Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti indicando nella dichiarazione dei redditi il codice fiscale 97023980580.
Re: [CODE4LIB] triple stores ???
Hi Ravi, Yeah, if you haven't seen it yet, take a look at the first link (http://www.w3.org/wiki/LargeTripleStores) in the search results that Stefano included. A big question is if you're going to need reasoning capabilities. If that's the case, you'll probably want to look at the first 3 in that list (Allegro from Franz, BigOWLIM from Ontotext, and Virtuoso from OpenLink). These are kinda the big 3 in terms of addressing the reasoning, large scalability, and performance issues If you don't really need the built-in reasoning or have huge needs in scalability/performance , I'd really recommend having a look at TinkerPop. For RDF capabilities, you'll probably need to use the TinkerPop Sail implementation ( check out this blog post = http://architects.dzone.com/articles/visualizing-rdf-schema ) . TinkerPop has a pretty good community around it and Neo4j has an excellent community. The setup is fairly easy, although the Sail stuff does have a learning curve. Performance seems very good, and using TinkerPop/Neo4j is great because it offers a variety of querying options. That all said, I've found performance for triple stores can be really hard to measure, since how you store your data and how you're querying can make all the difference...so often it's not the software, but the way the data model has been designed.. good luck! b,chris. On Tue, May 29, 2012 at 10:01 AM, Stefano Bargioni bargi...@pusc.it wrote: Maybe a G search can help to find comparisons: http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara The result includes your post... added 8 minutes ago. Stefano On 29/mag/2012, at 09.12, Ravi Shankar wrote: We (DLSS at Stanford Libraries) are planning to use a triple store for storing and retrieving annotations (in RDF) on digital objects. We are currently looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and Mulgara. Are you currently using a triple store or contemplating on using one? How would you evaluate 'your' triple store along the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 5) documentation and 6) community support? Highly appreciate your thoughts, ideas and suggestions. Thanks, Ravi Shankar __ Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto semplice ma di grande valore. Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di tutti noi. Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti indicando nella dichiarazione dei redditi il codice fiscale 97023980580.
Re: [CODE4LIB] triple stores ???
Thanks, Stefano. The Europeana report seems to be quite comprehensive. It is funny that I've searched earlier for triple store comparisons with more explicit parameters 'rdf triple store comparison', and the Europeana report appeared in the third page of the search results. The 'triple' in the search seems to be the culprit -- a clear need for more semantics in the search engine ;) Cheers, Ravi On May 29, 2012, at 1:01 AM, Stefano Bargioni wrote: Maybe a G search can help to find comparisons: http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara The result includes your post... added 8 minutes ago. Stefano On 29/mag/2012, at 09.12, Ravi Shankar wrote: We (DLSS at Stanford Libraries) are planning to use a triple store for storing and retrieving annotations (in RDF) on digital objects. We are currently looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and Mulgara. Are you currently using a triple store or contemplating on using one? How would you evaluate 'your' triple store along the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 5) documentation and 6) community support? Highly appreciate your thoughts, ideas and suggestions. Thanks, Ravi Shankar __ Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto semplice ma di grande valore. Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di tutti noi. Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti indicando nella dichiarazione dei redditi il codice fiscale 97023980580.
Re: [CODE4LIB] triple stores ???
For those using these big triplestores, how are you putting data in? I'm looking for a triplestore which supports SPARQL update. Any comments anyone can add on this interface will be useful. Ethan On May 29, 2012 4:12 PM, Ravi Shankar rshan...@stanford.edu wrote: Thanks, Stefano. The Europeana report seems to be quite comprehensive. It is funny that I've searched earlier for triple store comparisons with more explicit parameters 'rdf triple store comparison', and the Europeana report appeared in the third page of the search results. The 'triple' in the search seems to be the culprit -- a clear need for more semantics in the search engine ;) Cheers, Ravi On May 29, 2012, at 1:01 AM, Stefano Bargioni wrote: Maybe a G search can help to find comparisons: http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara The result includes your post... added 8 minutes ago. Stefano On 29/mag/2012, at 09.12, Ravi Shankar wrote: We (DLSS at Stanford Libraries) are planning to use a triple store for storing and retrieving annotations (in RDF) on digital objects. We are currently looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and Mulgara. Are you currently using a triple store or contemplating on using one? How would you evaluate 'your' triple store along the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 5) documentation and 6) community support? Highly appreciate your thoughts, ideas and suggestions. Thanks, Ravi Shankar __ Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto semplice ma di grande valore. Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di tutti noi. Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti indicando nella dichiarazione dei redditi il codice fiscale 97023980580.
Re: [CODE4LIB] triple stores ???
Hi Chris, We were currently leaning towards open-source triple stores. As far as inferencing goes, I suspect we will be doing at least transitive closures on rdfs:subClassOf and rdfs:subPropertyOf properties. I will look into TinkerPop. Are you currently using it, and for what purpose? I am also curious about what types of data model changes did you have to do to improve your rdf store's performance. Thanks, Ravi On May 29, 2012, at 2:43 AM, Chris Fitzpatrick wrote: Hi Ravi, Yeah, if you haven't seen it yet, take a look at the first link (http://www.w3.org/wiki/LargeTripleStores) in the search results that Stefano included. A big question is if you're going to need reasoning capabilities. If that's the case, you'll probably want to look at the first 3 in that list (Allegro from Franz, BigOWLIM from Ontotext, and Virtuoso from OpenLink). These are kinda the big 3 in terms of addressing the reasoning, large scalability, and performance issues If you don't really need the built-in reasoning or have huge needs in scalability/performance , I'd really recommend having a look at TinkerPop. For RDF capabilities, you'll probably need to use the TinkerPop Sail implementation ( check out this blog post = http://architects.dzone.com/articles/visualizing-rdf-schema ) . TinkerPop has a pretty good community around it and Neo4j has an excellent community. The setup is fairly easy, although the Sail stuff does have a learning curve. Performance seems very good, and using TinkerPop/Neo4j is great because it offers a variety of querying options. That all said, I've found performance for triple stores can be really hard to measure, since how you store your data and how you're querying can make all the difference...so often it's not the software, but the way the data model has been designed.. good luck! b,chris. On Tue, May 29, 2012 at 10:01 AM, Stefano Bargioni bargi...@pusc.it wrote: Maybe a G search can help to find comparisons: http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara The result includes your post... added 8 minutes ago. Stefano On 29/mag/2012, at 09.12, Ravi Shankar wrote: We (DLSS at Stanford Libraries) are planning to use a triple store for storing and retrieving annotations (in RDF) on digital objects. We are currently looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and Mulgara. Are you currently using a triple store or contemplating on using one? How would you evaluate 'your' triple store along the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 5) documentation and 6) community support? Highly appreciate your thoughts, ideas and suggestions. Thanks, Ravi Shankar __ Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto semplice ma di grande valore. Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di tutti noi. Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti indicando nella dichiarazione dei redditi il codice fiscale 97023980580.
Re: [CODE4LIB] triple stores ???
On Tue, May 29, 2012 at 4:22 PM, Ravi Shankar rshan...@stanford.edu wrote: We were currently leaning towards open-source triple stores. As far as inferencing goes, I suspect we will be doing at least transitive closures on rdfs:subClassOf and rdfs:subPropertyOf properties. I will look into TinkerPop. Are you currently using it, and for what purpose? I am also curious about what types of data model changes did you have to do to improve your rdf store's performance. There are some interesting results on implementing these entailments for Quest - http://obda.inf.unibz.it/protege-plugin/quest/quest.html See: http://www.inf.unibz.it/~calvanese/papers/rodr-calv-KR-2012.pdf They use pre-processing of the T-box to assign a numeric id to each class, then compute which ranges correspond to all the subclasses of a particular class. The benchmarking is rather preliminary, and the LUBM results are mixed, but the one result that that they give for an experiment on data from from *Resource Index* is interesting. The RI dataset uses the hierarchical data from a large number of biomedical ontologies and uses text minding to associate classes from those ontologies with documents from a corpus. The quest experiment used as subset of this data (clinicaltrials.gov). The ontologies had 3 million concepts and 2.5 million sub-class assertions. The annotation process generates a very large volume of data. For the resource used in this experiment, the Clinical Trials.gov (CT) collection, the annotation process generates 181 million ABox assertions (i.e., data triples), corresponding to 14 GB of data Note that given the limited expressivity of the TBox used by this application, we can avoid query reformulation w.r.t. the TBox by storing data using a Semantic Index. We stored the data in a DB2 9.7 DB hosted in a Linux virtual machine with 4x2.67 Ghz Intel Xeon processors (only one core was used) and 4 GB of RAM available to DB2. We issued several queries, the one we describe here is q(x) - DNA Repair Gene(x)^Antigen Gene(x)^Cancer Gene(x). The selectivity of the query is high, returning a total of 2 distinct resources. The performance of each technique is as follows: (a) when rewriting w.r.t. T in CNF form the result is one SQL query with 467874 disjuncts, when rewriting in DNF (as UCQ-based rewriters do), the result is a union of 467874 SPJ queries; none of these queries is executable by DB2 with our system setup; (b) when we rewrite the query using the Semantic Index technique, the result is a single SQL query involving 3 range disjunctions; the query requires 3.582s to execute (0.082s if the DB is warm, e.g., the indexes have been preloaded); the time required to compute the semantic index is 27s; the size of the semantic index 4 GB; (c) if the ABox is expanded and we execute the original query, the execution requires 3 s (0 s if warm). With respect to the cost of the expansion, LePendu et al. indicate that a straightforward expansion of the CT resource requires 7 days, and generates 140 GB of data and, after a careful optimization of the process (including data partitioning, parallelization, etc.) this time can be reduced to 40 minutes. Given these results, we believe that the Semantic Index is possibly a better option than data expansion, due to the drastic cost of the latter. Moreover, it scales to dimensions in which pure query reformulation may be impossible.
Re: [CODE4LIB] triple stores ???
Hi Ravi - I'll let some of my more technical folks chime in, but we do a bunch with RDF and have found every triplestore we've tried very limited in handling transactions. Reading and writing at the same time causes a deadlock that's a mess to keep clean. So, we went back where we started and created a triplestore using SQL with big tables of triples. We cheat a little bit with a fourth column for ID and a fifth that helps speed up blank node searching. This has helped us avoid these transactional problems we were having, and the performance is quite good for ingest. Most of our searching is done by stuffing the triples into solr in a JSON format, so we don't rely on the backend data store for that much. We also sync the SQL triples to Allegrograph in case we need deeper SPARQL things, but we're thinking of shedding this from our architecture. Declan -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ravi Shankar Sent: Tuesday, May 29, 2012 12:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] triple stores ??? We (DLSS at Stanford Libraries) are planning to use a triple store for storing and retrieving annotations (in RDF) on digital objects. We are currently looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and Mulgara. Are you currently using a triple store or contemplating on using one? How would you evaluate 'your' triple store along the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 5) documentation and 6) community support? Highly appreciate your thoughts, ideas and suggestions. Thanks, Ravi Shankar
Re: [CODE4LIB] triple stores ???
The latest version of Jena TDB adds atomic transactions (version 0.9.0+) See http://jena.apache.org/documentation/tdb/tdb_transactions.html for documentation: The following limitations are listed: - Bulk loads: the TDB bulk loader is not transactional - Nested transactions are not supported. - Some active transaction state is held exclusively in-memory, limiting scalability. - Long-running transactions. Read-transactions cause a build-up of pending changes; - If a single read transaction runs for a long time when there are many updates, the system will consume a lot of temporary resources. On Tue, May 29, 2012 at 7:00 PM, Fleming, Declan dflem...@ucsd.edu wrote: Hi Ravi - I'll let some of my more technical folks chime in, but we do a bunch with RDF and have found every triplestore we've tried very limited in handling transactions. Reading and writing at the same time causes a deadlock that's a mess to keep clean. So, we went back where we started and created a triplestore using SQL with big tables of triples. We cheat a little bit with a fourth column for ID and a fifth that helps speed up blank node searching. This has helped us avoid these transactional problems we were having, and the performance is quite good for ingest. Most of our searching is done by stuffing the triples into solr in a JSON format, so we don't rely on the backend data store for that much. We also sync the SQL triples to Allegrograph in case we need deeper SPARQL things, but we're thinking of shedding this from our architecture. Declan -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ravi Shankar Sent: Tuesday, May 29, 2012 12:12 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] triple stores ??? We (DLSS at Stanford Libraries) are planning to use a triple store for storing and retrieving annotations (in RDF) on digital objects. We are currently looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and Mulgara. Are you currently using a triple store or contemplating on using one? How would you evaluate 'your' triple store along the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 5) documentation and 6) community support? Highly appreciate your thoughts, ideas and suggestions. Thanks, Ravi Shankar