Re: [CODE4LIB] triple stores ???

2012-05-30 Thread Chris Fitzpatrick
Hey Ravi,

I actually learned about TinkerPop from a posting on this list from
Brian Tingle and I started playing with it and eventually started
working with it for the digital repository we're building. I actually
began with the Sail extension, but scaled back to non-RDF model on
that after realizing it wasn't really a requirement that our users
were asking for...simply using it as a graph db makes things go a lot
quicker. Of course, we're planning on revisiting implementing the
RDF-based model in the future.

In TinkerPop/Neo4j, there's ACID transaction support and there a bulk
loading utility.  However, there is not transaction support in the
bulk loader...also, I think Sail has kinda different/weird transaction
support, but I think you can override that in Tinkerpop somehow...

b,chris.



On Wed, May 30, 2012 at 2:29 AM, Simon Spero sesunc...@gmail.com wrote:
 The latest version of Jena TDB adds atomic transactions (version 0.9.0+)

 See http://jena.apache.org/documentation/tdb/tdb_transactions.html for
 documentation:

 The following limitations are listed:


   - Bulk loads: the TDB bulk loader is not transactional
   - Nested transactions are not supported.
   - Some active transaction state is held exclusively in-memory, limiting
   scalability.
   - Long-running transactions. Read-transactions cause a build-up of
   pending changes;
   - If a single read transaction runs for a long time when there are many
   updates, the system will consume a lot of temporary resources.




 On Tue, May 29, 2012 at 7:00 PM, Fleming, Declan dflem...@ucsd.edu wrote:

 Hi Ravi - I'll let some of my more technical folks chime in, but we do a
 bunch with RDF and have found every triplestore we've tried very limited in
 handling transactions.  Reading and writing at the same time causes a
 deadlock that's a mess to keep clean.  So, we went back where we started
 and created a triplestore using SQL with big tables of triples.  We cheat a
 little bit with a fourth column for ID and a fifth that helps speed up
 blank node searching.  This has helped us avoid these transactional
 problems we were having, and the performance is quite good for ingest.

 Most of our searching is done by stuffing the triples into solr in a JSON
 format, so we don't rely on the backend data store for that much.  We also
 sync the SQL triples to Allegrograph in case we need deeper SPARQL things,
 but we're thinking of shedding this from our architecture.

 Declan

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Ravi Shankar
 Sent: Tuesday, May 29, 2012 12:12 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] triple stores ???

 We (DLSS at Stanford Libraries) are planning to use a triple store for
 storing and retrieving annotations (in RDF) on digital objects. We are
 currently looking at open-source triple stores such as 4store, Virtuoso,
 Jena SDB and Mulgara. Are you currently using a triple store or
 contemplating on using one? How would you evaluate 'your' triple store
 along the lines of 1) ease of setup, 2) scalability, 3) query performance,
 3) bulk load performance, 4) access api, 5) documentation and 6) community
 support?

 Highly appreciate your thoughts, ideas and suggestions.

 Thanks,
 Ravi Shankar



Re: [CODE4LIB] triple stores ???

2012-05-29 Thread Stefano Bargioni
Maybe a G search can help to find comparisons:
http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara
The result includes your post... added 8 minutes ago.
Stefano

On 29/mag/2012, at 09.12, Ravi Shankar wrote:

 We (DLSS at Stanford Libraries) are planning to use a triple store for 
 storing and retrieving annotations (in RDF) on digital objects. We are 
 currently looking at open-source triple stores such as 4store, Virtuoso, Jena 
 SDB and Mulgara. Are you currently using a triple store or contemplating on 
 using one? How would you evaluate 'your' triple store along the lines of 1) 
 ease of setup, 2) scalability, 3) query performance, 3) bulk load 
 performance, 4) access api, 5) documentation and 6) community support?
 
 Highly appreciate your thoughts, ideas and suggestions.
 
 Thanks,
 Ravi Shankar
 


__
Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto semplice 
ma di grande valore.
Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di tutti 
noi.
Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti 
indicando nella dichiarazione dei redditi il codice fiscale 97023980580.


Re: [CODE4LIB] triple stores ???

2012-05-29 Thread Chris Fitzpatrick
Hi Ravi,

Yeah, if you haven't seen it yet, take a look at the first link
(http://www.w3.org/wiki/LargeTripleStores) in the search results that
Stefano included.

A big question is if you're going to need reasoning capabilities. If
that's the case, you'll probably want to look at the first 3 in that
list (Allegro from Franz, BigOWLIM from Ontotext, and Virtuoso from
OpenLink). These are kinda the big 3 in terms of addressing the
reasoning, large scalability, and performance issues

If you don't really need the built-in reasoning or have huge needs in
scalability/performance , I'd really recommend having a look at
TinkerPop.
For RDF capabilities, you'll probably need to use the TinkerPop Sail
implementation ( check out this blog post =
http://architects.dzone.com/articles/visualizing-rdf-schema ) .
TinkerPop has a pretty good community around it and Neo4j has an
excellent community. The setup is fairly easy, although the Sail stuff
does have a learning curve. Performance seems very good, and using
TinkerPop/Neo4j is great because it offers a variety of querying
options. That all said, I've found performance for  triple stores can
be really hard to measure, since how you store your data and how
you're querying can make all the difference...so often it's not the
software, but the way the data model has been designed..

good luck!
b,chris.







On Tue, May 29, 2012 at 10:01 AM, Stefano Bargioni bargi...@pusc.it wrote:
 Maybe a G search can help to find comparisons:
 http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara
 The result includes your post... added 8 minutes ago.
 Stefano

 On 29/mag/2012, at 09.12, Ravi Shankar wrote:

 We (DLSS at Stanford Libraries) are planning to use a triple store for 
 storing and retrieving annotations (in RDF) on digital objects. We are 
 currently looking at open-source triple stores such as 4store, Virtuoso, 
 Jena SDB and Mulgara. Are you currently using a triple store or 
 contemplating on using one? How would you evaluate 'your' triple store along 
 the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk 
 load performance, 4) access api, 5) documentation and 6) community support?

 Highly appreciate your thoughts, ideas and suggestions.

 Thanks,
 Ravi Shankar



 __
 Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto semplice 
 ma di grande valore.
 Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di 
 tutti noi.
 Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti 
 indicando nella dichiarazione dei redditi il codice fiscale 97023980580.


Re: [CODE4LIB] triple stores ???

2012-05-29 Thread Ravi Shankar
Thanks, Stefano. The Europeana report seems to be quite comprehensive. It is 
funny that I've searched earlier for triple store comparisons with more 
explicit parameters 'rdf triple store comparison', and the Europeana report 
appeared in the third page of the search results. The 'triple' in the search 
seems to be the culprit -- a clear need for more semantics in the search engine 
;)

Cheers,
Ravi

On May 29, 2012, at 1:01 AM, Stefano Bargioni wrote:

 Maybe a G search can help to find comparisons:
 http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara
 The result includes your post... added 8 minutes ago.
 Stefano
 
 On 29/mag/2012, at 09.12, Ravi Shankar wrote:
 
 We (DLSS at Stanford Libraries) are planning to use a triple store for 
 storing and retrieving annotations (in RDF) on digital objects. We are 
 currently looking at open-source triple stores such as 4store, Virtuoso, 
 Jena SDB and Mulgara. Are you currently using a triple store or 
 contemplating on using one? How would you evaluate 'your' triple store along 
 the lines of 1) ease of setup, 2) scalability, 3) query performance, 3) bulk 
 load performance, 4) access api, 5) documentation and 6) community support?
 
 Highly appreciate your thoughts, ideas and suggestions.
 
 Thanks,
 Ravi Shankar
 
 
 
 __
 Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto semplice 
 ma di grande valore.
 Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di 
 tutti noi.
 Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti 
 indicando nella dichiarazione dei redditi il codice fiscale 97023980580.


Re: [CODE4LIB] triple stores ???

2012-05-29 Thread Ethan Gruber
For those using these big triplestores, how are you putting data in?  I'm
looking for a triplestore which supports SPARQL update.  Any comments
anyone can add on this interface will be useful.

Ethan
On May 29, 2012 4:12 PM, Ravi Shankar rshan...@stanford.edu wrote:

 Thanks, Stefano. The Europeana report seems to be quite comprehensive. It
 is funny that I've searched earlier for triple store comparisons with more
 explicit parameters 'rdf triple store comparison', and the Europeana report
 appeared in the third page of the search results. The 'triple' in the
 search seems to be the culprit -- a clear need for more semantics in the
 search engine ;)

 Cheers,
 Ravi

 On May 29, 2012, at 1:01 AM, Stefano Bargioni wrote:

  Maybe a G search can help to find comparisons:
 
 http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara
  The result includes your post... added 8 minutes ago.
  Stefano
 
  On 29/mag/2012, at 09.12, Ravi Shankar wrote:
 
  We (DLSS at Stanford Libraries) are planning to use a triple store for
 storing and retrieving annotations (in RDF) on digital objects. We are
 currently looking at open-source triple stores such as 4store, Virtuoso,
 Jena SDB and Mulgara. Are you currently using a triple store or
 contemplating on using one? How would you evaluate 'your' triple store
 along the lines of 1) ease of setup, 2) scalability, 3) query performance,
 3) bulk load performance, 4) access api, 5) documentation and 6) community
 support?
 
  Highly appreciate your thoughts, ideas and suggestions.
 
  Thanks,
  Ravi Shankar
 
 
 
  __
  Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto
 semplice ma di grande valore.
  Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze
 di tutti noi.
  Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti
 indicando nella dichiarazione dei redditi il codice fiscale 97023980580.



Re: [CODE4LIB] triple stores ???

2012-05-29 Thread Ravi Shankar
Hi Chris,
We were currently leaning towards open-source triple stores. As far as 
inferencing goes, I suspect we will be doing at least transitive closures on 
rdfs:subClassOf and rdfs:subPropertyOf properties. I will look into TinkerPop. 
Are you currently using it, and for what purpose? I am also curious about what 
types of data model changes did you have to do to improve your rdf store's 
performance.

Thanks,
Ravi

On May 29, 2012, at 2:43 AM, Chris Fitzpatrick wrote:

 Hi Ravi,
 
 Yeah, if you haven't seen it yet, take a look at the first link
 (http://www.w3.org/wiki/LargeTripleStores) in the search results that
 Stefano included.
 
 A big question is if you're going to need reasoning capabilities. If
 that's the case, you'll probably want to look at the first 3 in that
 list (Allegro from Franz, BigOWLIM from Ontotext, and Virtuoso from
 OpenLink). These are kinda the big 3 in terms of addressing the
 reasoning, large scalability, and performance issues
 
 If you don't really need the built-in reasoning or have huge needs in
 scalability/performance , I'd really recommend having a look at
 TinkerPop.
 For RDF capabilities, you'll probably need to use the TinkerPop Sail
 implementation ( check out this blog post =
 http://architects.dzone.com/articles/visualizing-rdf-schema ) .
 TinkerPop has a pretty good community around it and Neo4j has an
 excellent community. The setup is fairly easy, although the Sail stuff
 does have a learning curve. Performance seems very good, and using
 TinkerPop/Neo4j is great because it offers a variety of querying
 options. That all said, I've found performance for  triple stores can
 be really hard to measure, since how you store your data and how
 you're querying can make all the difference...so often it's not the
 software, but the way the data model has been designed..
 
 good luck!
 b,chris.
 
 
 
 
 
 
 
 On Tue, May 29, 2012 at 10:01 AM, Stefano Bargioni bargi...@pusc.it wrote:
 Maybe a G search can help to find comparisons:
 http://www.google.com/search?sugexp=chrome,mod=4sourceid=chromeie=UTF-8q=4store+Virtuoso+Jena+SDB++Mulgara
 The result includes your post... added 8 minutes ago.
 Stefano
 
 On 29/mag/2012, at 09.12, Ravi Shankar wrote:
 
 We (DLSS at Stanford Libraries) are planning to use a triple store for 
 storing and retrieving annotations (in RDF) on digital objects. We are 
 currently looking at open-source triple stores such as 4store, Virtuoso, 
 Jena SDB and Mulgara. Are you currently using a triple store or 
 contemplating on using one? How would you evaluate 'your' triple store 
 along the lines of 1) ease of setup, 2) scalability, 3) query performance, 
 3) bulk load performance, 4) access api, 5) documentation and 6) community 
 support?
 
 Highly appreciate your thoughts, ideas and suggestions.
 
 Thanks,
 Ravi Shankar
 
 
 
 __
 Il tuo 5x1000 al Patronato di San Girolamo della Carita' e' un gesto 
 semplice ma di grande valore.
 Una tua firma aiutera' i sacerdoti ad essere piu' vicini alle esigenze di 
 tutti noi.
 Aiutaci a formare sacerdoti e seminaristi provenienti dai 5 continenti 
 indicando nella dichiarazione dei redditi il codice fiscale 97023980580.


Re: [CODE4LIB] triple stores ???

2012-05-29 Thread Simon Spero
On Tue, May 29, 2012 at 4:22 PM, Ravi Shankar rshan...@stanford.edu wrote:

We were currently leaning towards open-source triple stores. As far as
 inferencing goes, I suspect we will be doing at least transitive closures
 on rdfs:subClassOf and rdfs:subPropertyOf properties. I will look into
 TinkerPop. Are you currently using it, and for what purpose? I am also
 curious about what types of data model changes did you have to do to
 improve your rdf store's performance.


 There are some interesting results on implementing these entailments for
Quest -
http://obda.inf.unibz.it/protege-plugin/quest/quest.html

See: http://www.inf.unibz.it/~calvanese/papers/rodr-calv-KR-2012.pdf

They use pre-processing of the T-box to assign a numeric id to each class,
then compute which ranges correspond to all the subclasses of a particular
class.   The benchmarking is rather preliminary, and the LUBM results are
mixed, but the one result that  that they give for an experiment on data
from  from *Resource Index* is interesting.

The RI dataset uses the hierarchical data from a large number of biomedical
ontologies and uses text minding to associate classes from those ontologies
with documents from a corpus. The quest experiment used as subset of this
data  (clinicaltrials.gov).  The ontologies had

3 million concepts and 2.5 million sub-class assertions. The annotation
process generates a very large
volume of data. For the resource used in this experiment, the Clinical
Trials.gov (CT) collection, the annotation process
generates 181 million ABox assertions (i.e., data triples), corresponding
to 14 GB of data


Note that given the limited expressivity of the TBox used by this
application, we can avoid query reformulation w.r.t. the TBox by storing
data using a Semantic Index. We stored the data in a DB2 9.7 DB hosted in a
Linux virtual machine with 4x2.67 Ghz Intel Xeon processors (only one core
was used) and 4 GB of RAM available to DB2. We issued several queries, the
one we describe here is q(x)  - DNA Repair Gene(x)^Antigen Gene(x)^Cancer
Gene(x).

The selectivity of the query is high, returning a total of 2 distinct
resources. The performance of each technique is as follows: (a) when
rewriting w.r.t. T in CNF form the result is one SQL query with 467874
disjuncts, when rewriting in DNF (as UCQ-based rewriters do), the result is
a union of 467874 SPJ queries; none of these queries is executable by DB2
with our system setup; (b) when we rewrite the query using the Semantic
Index technique, the result is a single SQL query involving 3 range
disjunctions; the query requires 3.582s to execute (0.082s if the DB is
warm, e.g., the indexes have been preloaded); the time required to compute
the semantic index is 27s; the size of the semantic index 4 GB; (c) if the
ABox is expanded and we execute the original query, the execution requires
3 s (0 s if warm). With respect to the cost of the expansion, LePendu et
al. indicate that a straightforward expansion of the CT resource requires
7 days, and generates 140 GB of data and, after a careful optimization of
the process (including data partitioning, parallelization, etc.) this time
can be reduced to 40 minutes. Given these results, we believe that the
Semantic Index is possibly a better option than data expansion, due to the
drastic cost of the latter. Moreover, it scales to dimensions in which pure
query reformulation may be impossible.


Re: [CODE4LIB] triple stores ???

2012-05-29 Thread Fleming, Declan
Hi Ravi - I'll let some of my more technical folks chime in, but we do a bunch 
with RDF and have found every triplestore we've tried very limited in handling 
transactions.  Reading and writing at the same time causes a deadlock that's a 
mess to keep clean.  So, we went back where we started and created a 
triplestore using SQL with big tables of triples.  We cheat a little bit with a 
fourth column for ID and a fifth that helps speed up blank node searching.  
This has helped us avoid these transactional problems we were having, and the 
performance is quite good for ingest.

Most of our searching is done by stuffing the triples into solr in a JSON 
format, so we don't rely on the backend data store for that much.  We also sync 
the SQL triples to Allegrograph in case we need deeper SPARQL things, but we're 
thinking of shedding this from our architecture.

Declan

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ravi 
Shankar
Sent: Tuesday, May 29, 2012 12:12 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] triple stores ???

We (DLSS at Stanford Libraries) are planning to use a triple store for storing 
and retrieving annotations (in RDF) on digital objects. We are currently 
looking at open-source triple stores such as 4store, Virtuoso, Jena SDB and 
Mulgara. Are you currently using a triple store or contemplating on using one? 
How would you evaluate 'your' triple store along the lines of 1) ease of setup, 
2) scalability, 3) query performance, 3) bulk load performance, 4) access api, 
5) documentation and 6) community support?

Highly appreciate your thoughts, ideas and suggestions.

Thanks,
Ravi Shankar


Re: [CODE4LIB] triple stores ???

2012-05-29 Thread Simon Spero
The latest version of Jena TDB adds atomic transactions (version 0.9.0+)

See http://jena.apache.org/documentation/tdb/tdb_transactions.html for
documentation:

The following limitations are listed:


   - Bulk loads: the TDB bulk loader is not transactional
   - Nested transactions are not supported.
   - Some active transaction state is held exclusively in-memory, limiting
   scalability.
   - Long-running transactions. Read-transactions cause a build-up of
   pending changes;
   - If a single read transaction runs for a long time when there are many
   updates, the system will consume a lot of temporary resources.




On Tue, May 29, 2012 at 7:00 PM, Fleming, Declan dflem...@ucsd.edu wrote:

 Hi Ravi - I'll let some of my more technical folks chime in, but we do a
 bunch with RDF and have found every triplestore we've tried very limited in
 handling transactions.  Reading and writing at the same time causes a
 deadlock that's a mess to keep clean.  So, we went back where we started
 and created a triplestore using SQL with big tables of triples.  We cheat a
 little bit with a fourth column for ID and a fifth that helps speed up
 blank node searching.  This has helped us avoid these transactional
 problems we were having, and the performance is quite good for ingest.

 Most of our searching is done by stuffing the triples into solr in a JSON
 format, so we don't rely on the backend data store for that much.  We also
 sync the SQL triples to Allegrograph in case we need deeper SPARQL things,
 but we're thinking of shedding this from our architecture.

 Declan

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Ravi Shankar
 Sent: Tuesday, May 29, 2012 12:12 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] triple stores ???

 We (DLSS at Stanford Libraries) are planning to use a triple store for
 storing and retrieving annotations (in RDF) on digital objects. We are
 currently looking at open-source triple stores such as 4store, Virtuoso,
 Jena SDB and Mulgara. Are you currently using a triple store or
 contemplating on using one? How would you evaluate 'your' triple store
 along the lines of 1) ease of setup, 2) scalability, 3) query performance,
 3) bulk load performance, 4) access api, 5) documentation and 6) community
 support?

 Highly appreciate your thoughts, ideas and suggestions.

 Thanks,
 Ravi Shankar