Re: Jena on Cassandra - status

Claude Warren Wed, 07 Dec 2016 23:44:06 -0800

I will look into indexing the values.  Currently the tables are full form
indexes (e.g.. all columns are in the primary key).  However, scanning for
ranges of values is an issue that needs to be resolved as you point out.


Cassandra does not seem to do joins (no "or" clause).  The expectation is
that the client will do the joins or that the data will be stored so that
joins are not needed.

I have taken to thinking about Cassandra as a massive collection of named
graphs and having the ability to extract sub graphs from that collection
that will then be queried for the solution.

So in the StageGenerator design I was going to start by breaking up the BGP
into groups based on the Subject.

find the subject group that has the most qualified statement (i.e. the
statement with the fewest unknowns) and start resolving the subject based
on that.  Basically pull back a CONSTRUCT ?s ?p ?o WHERE { ?s ?p ?o BIND(
?s "subject" )}. and place that into a temporary local (in memory or small
TDB or small SDB) graph.  iterate through the groups from the BGP
performing the resolution (and adding bindings from the temporary graph)
adding the results to the temp graph as we go.

Finally perform a query against the local graph to return all the results
properly.

Now given your statement about StageGenerator being a dead end, I will have
to rework that to fit the OPExecutor pattern.

Not having an "OR" in CQL makes the cassandra implementation "interesting".

Claude

On Wed, Dec 7, 2016 at 9:59 PM, Andy Seaborne <[email protected]> wrote:

>
>
> On 06/12/16 15:05, Claude Warren wrote:
>
>> Looking briefly at the cumulus documentation, I am doing something
>> similar.  I think they will handle range scanning better than this will
>> with the current implementation, but I am fairly certain that can be fixed
>> in the future.
>>
>> My plans are:
>>
>>    1. Get the unit tests done
>>    2. Get the assembler code working
>>    3. Expand on the execution strategy.  For example, I think that a
>> custom
>>    StageGenerator would probably help a lot.  Perhaps some join
>> optimization
>>    as well.
>>
>
> StageGenerator is bit of a deadend for this long term.  Single graph only.
>
> Implementing OpExecutor for this store is the same amount of work for just
> joins and has the possibility of more work being sent to Cassandra in the
> future (quads, leftjoin, simple filtering)
>
> Subclass and override
>    protected QueryIterator execute(OpBGP opBGP, QueryIterator input)
>
> Have you considered putting values into the indexes as well as the RDF
> term for the object field? Then Cassandra can do some FILTERs which would
> enable large amounts or data to be scanned/filtered in parallel. Somthing
> to learn from SDB.
>
> The other factor to drive scale, is being about to do merge joins over
> steaming results from Cassandra.  Statement.setFetchSize seems to be the
> way to get repeated small blocks of results in an overall large return
> which is perfect.  This will affect the choice of indexes.
>
> Otherwise a parallel hash join will mean multiple CQL statements can be
> active at one time but that is more demanding of client-side resources.
>
>         Andy
>
>
>
>> Claude
>>
>> On Tue, Dec 6, 2016 at 2:50 PM, A. Soroka <[email protected]> wrote:
>>
>> This is really neat to see, Claude! There is also:
>>>
>>> https://github.com/cumulusrdf/cumulusrdf
>>>
>>> which uses Cassandra to support the Sesame API. Are you using a similar
>>> arrangement inside Cassandra?
>>>
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>>
>>> On Dec 5, 2016, at 5:42 PM, Claude Warren <[email protected]> wrote:
>>>>
>>>> I have setup a quick github with the Cassandra code (such as it is).
>>>> https://github.com/Claudenw/jena-on-cassandra
>>>>
>>>> I was going to work on the assembler, but I am backing off that to get
>>>>
>>> the
>>>
>>>> unit test in place first.
>>>>
>>>>
>>>>
>>>> On Mon, Dec 5, 2016 at 9:36 AM, Claude Warren <[email protected]> wrote:
>>>>
>>>> Howdy,
>>>>>
>>>>> For those who are wondering.
>>>>>
>>>>> I did get permission to implement and contribute the Jena on Cassandra
>>>>> code.
>>>>>
>>>>> We have a graph implementation and a DataSetGraph.  I still need to
>>>>> implement the contract tests but I think the code works.
>>>>>
>>>>> Plan is to implement an assembler and begin to look at StageGenerator
>>>>>
>>>> and
>>>
>>>> other classes to take advantage of the capabilities of the Cassandra
>>>>>
>>>> data
>>>
>>>> store.
>>>>>
>>>>> I need to find a place to put the code (git repository).
>>>>>
>>>>> Claude
>>>>>
>>>>> --
>>>>> I like: Like Like - The likeliest place on the web
>>>>> <http://like-like.xenei.com>
>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> I like: Like Like - The likeliest place on the web
>>>> <http://like-like.xenei.com>
>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>>
>>>
>>>
>>>
>>
>>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: Jena on Cassandra - status

Reply via email to