Re: Jena on Cassandra - status

Andy Seaborne Thu, 08 Dec 2016 04:42:56 -0800

This

http://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise

talks about using spark for the join engine. An interesting possibilityin the future.

In relational terms, joins are an AND between columns in differenttables. OR is UNION. There is also "and" and "or" in restrictions onvalues ("relation" in CQL speak?). I was surprised by the lack of OR andgeneral NOT. Maybe it is exposing how table scans are done across thecluster.


    Andy




On 08/12/16 07:43, Claude Warren wrote:

I will look into indexing the values.  Currently the tables are full form
indexes (e.g.. all columns are in the primary key).  However, scanning for
ranges of values is an issue that needs to be resolved as you point out.

Cassandra does not seem to do joins (no "or" clause).  The expectation is
that the client will do the joins or that the data will be stored so that
joins are not needed.

I have taken to thinking about Cassandra as a massive collection of named
graphs and having the ability to extract sub graphs from that collection
that will then be queried for the solution.

So in the StageGenerator design I was going to start by breaking up the BGP
into groups based on the Subject.

find the subject group that has the most qualified statement (i.e. the
statement with the fewest unknowns) and start resolving the subject based
on that.  Basically pull back a CONSTRUCT ?s ?p ?o WHERE { ?s ?p ?o BIND(
?s "subject" )}. and place that into a temporary local (in memory or small
TDB or small SDB) graph.  iterate through the groups from the BGP
performing the resolution (and adding bindings from the temporary graph)
adding the results to the temp graph as we go.

Finally perform a query against the local graph to return all the results
properly.

Now given your statement about StageGenerator being a dead end, I will have
to rework that to fit the OPExecutor pattern.

Not having an "OR" in CQL makes the cassandra implementation "interesting".

Claude

On Wed, Dec 7, 2016 at 9:59 PM, Andy Seaborne <[email protected]> wrote:



On 06/12/16 15:05, Claude Warren wrote:

Looking briefly at the cumulus documentation, I am doing something
similar.  I think they will handle range scanning better than this will
with the current implementation, but I am fairly certain that can be fixed
in the future.

My plans are:

   1. Get the unit tests done
   2. Get the assembler code working
   3. Expand on the execution strategy.  For example, I think that a
custom
   StageGenerator would probably help a lot.  Perhaps some join
optimization
   as well.


StageGenerator is bit of a deadend for this long term.  Single graph only.

Implementing OpExecutor for this store is the same amount of work for just
joins and has the possibility of more work being sent to Cassandra in the
future (quads, leftjoin, simple filtering)

Subclass and override
   protected QueryIterator execute(OpBGP opBGP, QueryIterator input)

Have you considered putting values into the indexes as well as the RDF
term for the object field? Then Cassandra can do some FILTERs which would
enable large amounts or data to be scanned/filtered in parallel. Somthing
to learn from SDB.

The other factor to drive scale, is being about to do merge joins over
steaming results from Cassandra.  Statement.setFetchSize seems to be the
way to get repeated small blocks of results in an overall large return
which is perfect.  This will affect the choice of indexes.

Otherwise a parallel hash join will mean multiple CQL statements can be
active at one time but that is more demanding of client-side resources.

        Andy

Claude

On Tue, Dec 6, 2016 at 2:50 PM, A. Soroka <[email protected]> wrote:

This is really neat to see, Claude! There is also:


https://github.com/cumulusrdf/cumulusrdf

which uses Cassandra to support the Sesame API. Are you using a similar
arrangement inside Cassandra?

---
A. Soroka
The University of Virginia Library

On Dec 5, 2016, at 5:42 PM, Claude Warren <[email protected]> wrote:


I have setup a quick github with the Cassandra code (such as it is).
https://github.com/Claudenw/jena-on-cassandra

I was going to work on the assembler, but I am backing off that to get

the

unit test in place first.



On Mon, Dec 5, 2016 at 9:36 AM, Claude Warren <[email protected]> wrote:

Howdy,


For those who are wondering.

I did get permission to implement and contribute the Jena on Cassandra
code.

We have a graph implementation and a DataSetGraph.  I still need to
implement the contract tests but I think the code works.

Plan is to implement an assembler and begin to look at StageGenerator

and

other classes to take advantage of the capabilities of the Cassandra

data

store.


I need to find a place to put the code (git repository).

Claude

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren



--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: Jena on Cassandra - status

Reply via email to