Re: Graph on Cassandra

Claude Warren Mon, 31 Oct 2016 13:26:28 -0700

We don't have code at the moment.  We (the team I am on at work) are
planning on implementing on Cassandra.  That would mean that we would have
a couple of developers watching and at least one working on the code until
it was stable.


I was hoping that we would be able to contribute this to the jena project
as a complete module.   I understand not wanting to put it in as part of
the project at the beginning,  but that was my goal.

I don't have a release schedule in mind as the in house project is still
fluid.  It might make sense to put it on github to start, but I would like
to see it in a Jena based repo in order to make it more visible to the
development community.

As I keep saying, I need to get final approval from legal before
proceeding.  I expect to hear something later this week.

Claude

On Mon, Oct 31, 2016 at 5:53 PM, Andy Seaborne <[email protected]> wrote:

>
>
> On 31/10/16 13:41, Claude Warren wrote:
>
>> Andy,
>>
>> This seems like a good approach but does not appear to be in the Jena code
>> base, which I suppose is your comment about an approach to developing
>> work.
>>
>> Does it make sense to create git clones that contain the new work?  Or
>> perhaps branches?
>>
>> Do you have a suggestion or direction you would like to see this go?
>>
>
> That's the discussion to have.  The first item is "Community".  This is
> all new code? Who is involved? Just you so far?
>
> A storage layer is not trivial - this is not an "extra" thing.  It is a
> module of it's own, and if the community is significantly different, maybe
> a different different mailing lists (e.g. solr within the the Lucene
> project), maybe even a different project; it can be "straight to TLP" or
> "incubated" - that depends on who is involved.  There are a wide set of
> possibilities.
>
> If it is starting off, then the Jena git repo isn't a good place to have
> the code.  The lifecycles don't line up.
>
> A branch that is complete separate is really a separate repo.  Jena can
> get another git repo.
>
> What would be the release cycle?
> The real issue is the work needed by the PMC for releases.
>
> To get all options mentioned:
>
> If this is a one-person effort for now, then starting a github repo and
> creating the initial sketch/framework is an option.  More focused. More
> freedom to try things out and change directions.
>
>         Andy
>
>
>
>> Claude
>>
>>
>>
>> On Fri, Oct 28, 2016 at 2:35 PM, Andy Seaborne <[email protected]> wrote:
>>
>> Claude,
>>>
>>> These may help:
>>>
>>> I have been thinking about an interface that is more oriented to the
>>> storage than the full DatasetGraph.
>>>
>>> StorageRDF breaks down all the operations into those on the default graph
>>> and those on named graphs.  For just a graph, simply ignore the named
>>> graph
>>> operations.
>>>
>>> https://github.com/afs/AFS-Dev/blob/master/src/main/java/pro
>>> jects/dsg2/storage/StorageRDF.java
>>>
>>> There is an adapter to the DatasetGraph hierarchy (which is needed for
>>> SPARQL):
>>>
>>> https://github.com/afs/AFS-Dev/blob/master/src/main/java/pro
>>> jects/dsg2/DatasetGraphStorage.java
>>>
>>> If you want to only use existing classes, DatasetGraphTriplesQuads is the
>>> place to start - used by TIM and TDB - yuo can implement without needing
>>> quads/named graphs. Again, simply ignore (throw
>>> UnsupportedOperationException for the named graph calls).
>>>
>>> Going the graph route could lead to rework later on for any kind of
>>> performance issues because find(S,P,O) is so narrow and precludes union
>>> default graph except by brute force.  DatasetGraph work with the SPARQL
>>> execution engine.
>>>
>>> We still need to discuss how best to approach developing work - it should
>>> not get sucked up by the release cycle.
>>>
>>>         Andy
>>>
>>>
>>> On 26/10/16 19:21, Claude Warren wrote:
>>>
>>> My plan is to start with a Graph implementation.  We expect to write 3
>>>> tables: SPO, POS, OPS (I think).  Currently we don't have an easy way to
>>>> handle find( ANY, ANY, ANY) so I suspect we will just start with
>>>> permitting
>>>> a column scan on Cassandra.
>>>>
>>>> I have not looked at DynamoDB but as I recall there are significant
>>>> differences under the hood.
>>>>
>>>> I expect that we will move on to a custom model or query engine to get
>>>> the
>>>> best performance but that is not what we are planning for the first cut.
>>>>
>>>> I am still waiting for management approval to do this at work ....
>>>> sometimes it takes longer to get the paperwork done than it does to
>>>> design
>>>> the thing.
>>>>
>>>>
>>>> Claude
>>>>
>>>> On Mon, Oct 17, 2016 at 6:39 PM, Paul Houle <[email protected]>
>>>> wrote:
>>>>
>>>> I like DynamoDB as a target for this sort of thing.  There are many
>>>>
>>>>> tasks which are small-scale yet critical where it would otherwise be
>>>>> hard to provide a distributed and reliable database.  Put that together
>>>>> with Lambda,  which does the same for computation,  and you are cooking
>>>>> with gas.
>>>>>
>>>>> I wrote a 1-1 translation of DynamoDB documents to RDF that I use
>>>>> throughout an application;  the code is DynamoDB idiomatic in every
>>>>> way,
>>>>>  just the application reads and writes (a constrained set of) RDF
>>>>> documents.
>>>>>
>>>>> Right now I dump the documents from the DynamoDB system into a triple
>>>>> store when I want a panoptic view,  but with a distributed graph like
>>>>> that would mean being able to run SPARQL queries against DynamoDB
>>>>> directly.
>>>>>
>>>>> There are many products in the same family as Cassandra and DynamoDB
>>>>> and
>>>>> it would be good to think through the math so we can approach them all
>>>>> in a similar way.
>>>>>
>>>>> --
>>>>>   Paul Houle
>>>>>   [email protected]
>>>>>
>>>>> On Mon, Oct 17, 2016, at 12:31 PM, A. Soroka wrote:
>>>>>
>>>>> Yep,
>>>>>>
>>>>>> http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/
>>>>>>
>>>>>> Workshops/SSWS/Ladwig-et-all-SSWS2011.pdf
>>>>>
>>>>>
>>>>>> indicates that they are indexing by subject. As someone who has
>>>>>> implemented LDP, that is definitely the approach that makes sense
>>>>>> there.
>>>>>>
>>>>>> ---
>>>>>> A. Soroka
>>>>>> The University of Virginia Library
>>>>>>
>>>>>> On Oct 17, 2016, at 12:20 PM, Andy Seaborne <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> IIRC It stores CBDs indexed by subject so it is the "other" model to
>>>>>>>
>>>>>>> Rya.  Better for LDP (??).
>>>>>>
>>>>>
>>>>>
>>>>>>     Andy
>>>>>>>
>>>>>>> On 17/10/16 15:41, A. Soroka wrote:
>>>>>>>
>>>>>>> There's also:
>>>>>>>>
>>>>>>>> https://github.com/cumulusrdf/cumulusrdf
>>>>>>>>
>>>>>>>> in a similar vein (RDF over Cassandra). Not sure what kind of
>>>>>>>>
>>>>>>>> particular uses it expects to support.
>>>>>>>
>>>>>>
>>>>>
>>>>>> ---
>>>>>>>> A. Soroka
>>>>>>>> The University of Virginia Library
>>>>>>>>
>>>>>>>> On Oct 17, 2016, at 7:02 AM, Andy Seaborne <[email protected]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Claude,
>>>>>>>>>
>>>>>>>>> There is certainly interest from me.
>>>>>>>>>
>>>>>>>>> What the best thing to do depends on various factors.  By putting
>>>>>>>>> it
>>>>>>>>>
>>>>>>>>> in extras I presume you mean it gets added to the release?  That is
>>>>>>>>
>>>>>>> not the
>>>>> only way forward.
>>>>>
>>>>>
>>>>>> An important aspect of Apache is "Community over code" - will there
>>>>>>>>>
>>>>>>>>> be a community around this code?  Is that community the same, or
>>>>>>>>
>>>>>>> significant overlap, as the Jena community?
>>>>>
>>>>>
>>>>>> There are various reasons for wanting RDF over a column store -
>>>>>>>>>
>>>>>>>>> which use cases are the most important for this work?
>>>>>>>>
>>>>>>>
>>>>>
>>>>>> They lead to different ways of using Cassandra. For example,
>>>>>>>>>
>>>>>>>>> Rya(incubating) uses Accumulo tables as indexes, and partial scans
>>>>>>>> of
>>>>>>>>
>>>>>>> the
>>>>> table is streaming.  Other systems try to use the columns for
>>>>> properties,
>>>>> possibly more useful for LDP style than SPARQL.
>>>>>
>>>>>
>>>>>>   Andy
>>>>>>>>>
>>>>>>>>> On 15/10/16 18:38, Claude Warren wrote:
>>>>>>>>>
>>>>>>>>> Howdy,
>>>>>>>>>>
>>>>>>>>>> We have a project at work that is implementing Jena Graph on
>>>>>>>>>>
>>>>>>>>>> Cassandra.  I
>>>>>>>>>
>>>>>>>>
>>>>> am wondering if there is enough interest here to accept it as a
>>>>>>
>>>>>>> contribution.  I was thinking that it might fit in the Extras
>>>>>>>>>>
>>>>>>>>>> category.
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>>> I can not promise release of the code yet as I have to present it
>>>>>>>>>>
>>>>>>>>>> to our
>>>>>>>>>
>>>>>>>>
>>>>> internal Intellectual Property group first.
>>>>>>
>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>> Claude
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: Graph on Cassandra

Reply via email to