Definitely food for thought and the ideas definitely have merit! I was once developing an algorithm to identify groups based on clusters of in-common friends links; and as you touched in your post; naming/identifying these fledging anonymous groups was one of the first things I had to deal with. I hadn't even considered adding an optimizer/classifier to detect often matched relationships. I mean it makes sense "these two nodes with this intervening path comes up often in queries; perhaps we can ask to name it?"
The other cool thing you touched on was a practical mechanism for how to track and manage these derived graphs; basically "create a new graph for them!". That's way better (and not too mention easier) than what I was thinking of. I could even see using the hooks mechanism to create and maintain these derived graph links. As far as applications go; might brain breaks if I think about it too much; the implications are way too staggering. Especially if the thing actually performs even half-way decently! What's also great is family trees are complicated enough; yet has answers that are thouroughly comprehended enough to model the "correct" behaviour for a lot of this derived link stuff! In the meantime, what's fantastic is I think neo4j already has the necessary query infrastructure for expanding these sub-queries at runtime through CYPHER's support of multiple starting points using aggregates from earlier clauses. I'm thinking that's going to help. Mike On Sep 10, 2014 7:09 PM, "Peter Hunsberger" <[email protected]> wrote: > This caught my eye because it touches on an area that I have a deep > interest in, namely the higher levels of classification of data sets and > the generalization of metadata as data within graph databases. The > addition of higher level node types rapidly gets into the exploration of > hypergraphs and you may want to Google a bit on some of those > implementations and the resultant implications. In particular, super nodes > in any large graph. IE; in your example, gender would divide any normally > distributed population in half and not be much use for partitioning. > However, I digress... > > The whole concept of overlaying metadata on top of another graph has > several standard use cases: type systems, security / authorization systems, > schema, ontologies , etc. The backwards derivation of the latter of these > is particularly interesting. Consider a small graph that builds on your > example: SIBLING, PARENT, MOTHER, FATHER, BROTHER, SISTER, etc. It's easy > to see how a separate graph of these relationships could be manually > defined and the edges from these to some population might result in useful > indexes to optimize searches. More interesting would be watching queries > go by and deriving these concepts: an optimizer / classifier might not know > the name of the resultant concepts, but persisting the repeatedly used > sub-query results as an index would yield metadata that can be used for > further analytics if one so desired. EG; we may not have a simple name for > Friend of a Friend once removed (like 2nd cousin) but the concept may still > be useful within some domain... > > I think the key here would be identification of common or dense sub-query > results. Typically, that might start as some form of cache where we can > spot high reuse of some set of items The grouping of similar versions of > these cached items would then start to point at patterns (eg. all sub query > results using the SIBLING relationship) Manual inspection would be the > last step: these two subgroups statistically divide this population in > half, further inspection shows that is because they diverge on the GENDER. > We now can label these sub-query results as BROTHER and SISTER.. Under the > covers "naming" these results might actually be the naming of nodes of a > metadata graph or hyper graph that is associated with the original graph > and this is in turn initially a way of identifying the subquery results the > system found to have a high incidence in the cache. > > I'm mostly guessing at possible implementations, but maybe food for > thought for someone... > > Peter Hunsberger > > On Tue, Sep 9, 2014 at 10:14 PM, Michael Fair <[email protected]> > wrote: > >> >> On Thursday, September 4, 2014 2:26:39 AM UTC-7, Mark Findlater wrote: >>> >>> hood stuff as I do not know enough about how Neo maintains it's indexes, >>> how the query/relationship/object caches work and whether there is any >>> performance impact of creating twice as many relationships (I expect that >>> if I do not get smarter with my data I would reach maximum capacity of Neo >>> in 3 years, if I doubled the relationships I suppose I would half that, but >>> that is very naive!). >>> >> >> One possible confusion point; the process wouldn't create twice as many >> relationships between nodes, or even twice as many relationship types; it >> would only create twice as many relationship type labels (and that's >> assuming every type got an inverse defined on it (which not everything >> would)). >> >> So maybe you go from 50 type labels or 100 or even 500 type labels to >> twice that but the actual data in the graph itself doesn't change at all, >> there's just a few more string metadata entries about the graph. I know >> you get this as what's meant by "not-persisting" the information but I >> wanted to be clear. We have to persist _something_; so this process is >> only persisting additional labels for the relationship types. >> >> That said; this whole conversation, especially your comments about >> formalizing this into a distributed processing system, has got me thinking >> that what we really ought to do here is take on thinking of every >> relationship type as a named sub-query and existing types are a specific >> trivial case. See below. >> >> >> >>> Running with a team of developers I would not expect all or even many of >>> them to be operating on raw queries and similar to not producing inverse >>> methods all over the place (although you could cite Commons StringUtils >>> isEmpty/isNotEmpty type design) I expect people to be comfortable with >>> negation, or applying more complex logical analysis. Which is what I meant >>> about exposing the raw query interface to the world at which point I might >>> expect less of what people are comfortable with! >>> >> >> That's interesting, because I actually expect less complex and domain >> specific applications that expose more of the framework that enable >> repeating common usage patterns easy, making it easier for users to >> explore, navigate, add and composite the data and create new forms within >> their own graphs. :) >> The application developers are going to know less and less about the >> specific schemas their users are using over time and focus more and more on >> providing good analysis and patterning tools that enable users to take more >> advantage of their graphs. :) >> >> >> >>> If you could define the inferred relationships such that the planner >>> could use them (he said :SIBLING but he means :SISTER or :BROTHER) or, as >>> you touched on below defining the rules and then explicitly running a >>> MapReduce type function to expose the data sound very interesting. >>> >> >> So I think this where you've struck upon the BIG IDEA! >> I'm so glad you mentioned this as you're right; I was thinking way too >> small with this idea! >> You're spot on in thinking this should help out the distributed >> applications of the "Query Graph Search/Reducer" application. >> >> If :SIBLING was one of these new "types as sub-query" things; and it was >> defined as something like >> [s:SIBLING] >> MATCH path = (n)<-[:CHILD_OF]-(parent)-[:CHILD_OF]->(m) >> WHERE n != m >> SET s.gender = m.gender, s.parents = collect(DISTINCT parent) >> RETURN n, s, m >> >> then it could absolutely work as you described. >> >> And as :SIBLING is now a named query type; defined directly in the >> database; if the db architect (or stats tracker process in the DB engine) >> decided this relationship was something worth optimizing for, then there's >> probably a way the engine could treat it like both an INDEX and a >> RELATIONSHIP, where the [:SIBLING] relationship can be dynamically rebuilt >> and maintained as needed. >> Anytime a [:CHILD_OF] relationship is created/deleted, the engine could >> know/detect that it needs to update the [:SIBLING] index on the updated >> nodes. >> >> Building on this idea, the [:SIBLING] relationship could then be used to >> then define other relationships like [:SISTER] and [:BROTHER]. >> [:SISTER] <= MATCH (n)<-[s:SIBLING {gender: "female}]-(m) RETURN s.parents >> [:BROTHER] <= MATCH (n)<-[s:SIBLING {gender: "male}]-(m) RETURN s.parents >> [:STEPSIBLING] <= MATCH (n)<-[s:SIBLING]-(m)-[:PARENT]->(p) WHERE p not >> in s.parents RETURN s.gender, s.parents >> >> =============================== >> Now that we've got queries; let's talk about reducing the results. >> Query: >> MATCH path = (n)<--(p)-->(m) >> RETURN graphReduce(path, [:BROTHER|:SISTER|:SIBLING]) >> >> So this would take a result set that starts like this [from >> (n)<--(p)-->(m)]: >> (Johnny {gender: male})<-[:STUDENT_OF]-(Sandra)-[:STUDENT_OF]->(Sally >> {gender: female}) >> (Johnny {gender: male})<-[:CHILD_OF]-(Bob)-[:CHILD_OF]->(Sally {gender: >> female}) >> (Johnny {gender: male})<-[:CHILD_OF]-(Alice)-[:CHILD_OF]->(Sally {gender: >> female}) >> (Sally {gender: female})<-[:CHILD_OF]-(Bob)-[:CHILD_OF]->(Johnny {gender: >> male}) >> (Sally {gender: female})<-[:CHILD_OF]-(Alice)-[:CHILD_OF]->(Johnny >> {gender: male}) >> >> and return: >> >> (Johnny {gender: male})-[:SISTER {parents: (Bob, Alice)}]->(Sally >> {gender: female}) >> (Johnny {gender: male})-[:SIBLING {gender: female, parents: (Bob, >> Alice)}]->(Sally {gender: female}) >> (Sally {gender: female})-[:BROTHER {parents: (Bob, Alice)}]->(Johnny >> {gender: male}) >> (Sally {gender: female})-[:SIBLING {gender: male, parents: (Bob, >> Alice)}]->(Johnny {gender: male}) >> >> I don't have the exact semantics nailed down for what happens if multiple >> relationship matches are found, I think it should just collect them and >> return them all in no particular order as that is likely the most >> parallelizable and distributed operation, as well as the most to definable. >> >> So using this model, [:PARENT_TO] (the original inverse relationship >> goal) could be something like: >> [p:PARENT_TO] <= >> MATCH (m)<-[c:CHILD_OF]-(n) >> SET p = c >> RETURN n, p, m >> >> What do you think? >> >> Mike >> >> On Wednesday, 3 September 2014 21:52:57 UTC+1, Michael Fair wrote: >>>> >>>> >>>> >>>> >>>>> On Wednesday, September 3, 2014 3:29:34 AM UTC-7, Mark Findlater wrote: >>>>> >>>> Hey Mike I think that this is interesting and it reminds me of using >>>>> the OWL *inverseOf *property relationship and other features of the >>>>> inferencing engine (overview >>>>> <http://www.google.com/url?q=http%3A%2F%2Fwww.w3.org%2FTR%2F2004%2FREC-owl-features-20040210%2F&sa=D&sntz=1&usg=AFQjCNHzDx_ohrhoiPKvfdY6uyCekmPqtA> >>>>> , spec <http://www.w3.org/TR/owl-ref/#inverseOf-def>). Or stepping >>>>> further back (in my learning) defining rules in Prolog. >>>>> >>>> >>>> This is definitely about making the schema/engine smarter about at >>>> "semantics". >>>> >>>> >>>> >>>>> I guess that my question as a developer is why I would want the >>>>> inverse relationship label defined? >>>>> >>>> >>>> I can see where a single app written by a single developer or two who >>>> are all the combined schema master, data integrity maintainer, and code >>>> author likely wouldn't need "help" to translate into their schema to query >>>> it; they wrote it and likewise if they want to extend the app, they can >>>> just write more code to extend the app; if they need a new relationship, >>>> they just invent it. I'd argue that even at that small scale, while it's >>>> not "needed" it's still helpful. >>>> >>>> >>>> >>>>> With the triple stores I have used there is usually an inference >>>>> engine which then expose to queries the union of both the asserted triples >>>>> (the data that you explicitly added) and the inferred triples (the ones >>>>> that rules were used to create). The important thing about the inferred >>>>> triples is that they are ephemeral meaning that they do not impact the >>>>> underlying datastore (although you could choose the materialize them) >>>>> which >>>>> stops unnecessary bloat (at the cost of processing). >>>>> >>>> >>>> Well storage-wise, the db is storing a numeric relationship type >>>> identifier, and then separately storing the relationship type label >>>> information. >>>> >>>> So storage-wise, adding multiple relationship type entries for the same >>>> type id isn't a problem. >>>> The on-disk structure is totally capable of handling this, it's just >>>> changing the type of block pointed to in the RelatioshipTypeStore from >>>> String to String[]. >>>> It's the code that loads this from the disk that would need to be >>>> updated to understand it's getting a String[] back and not just a String. >>>> >>>> I'm now however convinced that updating the code to detect a String >>>> versus a String[] in the RelationshipTypeStore is going to be way easier >>>> than all the splitting and parsing gymnastics I was thinking of earlier! :) >>>> Though I still might encode the "direction" as part of the string. >>>> >>>> >>>> As these new relationships are sugar which is ultimately bloat would >>>>> you ever want to persist them? I also suspect that the directional trick >>>>> you suggest could have serious implications to the performance of the >>>>> query >>>>> engine (but that's just a hunch). >>>>> >>>> >>>> We can query (a)<-[:REL]-(b) for the same performance as >>>> (a)-[:REL]->(b) so the only trick is getting the planner to know what's >>>> being expressed, and that's what I think the parser's job is. >>>> >>>> All relationships in neo4j have a from/to direction by design. The >>>> label for that direction is rather arbitrary created at design time. Being >>>> able to define the inverse labels for a relationship type eliminates that >>>> required design time arbitrary selection and I can't see how it has any >>>> performance impact. The planner still knows it's looking for a numeric >>>> relationship id (e.g. 62), and what it needs to know about from and to; I >>>> suspect most of the heavy lifting on that was created in the parser. From >>>> what I've gathered a "directionless" relationship match actually becomes >>>> two matches under the hood, one in each direction. >>>> This would just make a similar kind of translation. >>>> >>>> >>>> >>>> The Sibling/Spouse example has limited mileage for me without some >>>>> clever chaining (e.g. how would I represent cousins? Harden SIBLING and >>>>> PARENT/SPOUSE relationships and then chain SIBLING->PARENT->SIBLING->CHILD >>>>> nodes, hardening along the way?) and again I think I would see more value >>>>> in adding this a level above the underlying store. >>>>> >>>> >>>> You jumped the gun on me! :) >>>> I'm actually working on a "relationships as sub-queries" plan and >>>> [:COUSIN], [:NIECE], [:NEPHEW] are perfect examples of that. >>>> I guess I just ought to have simply proposed these named sub-queries as >>>> part of the initial discussion. :) >>>> >>>> It's cool that you used the PARENT/CHILD pairing in the example >>>> (n)-[:PARENT]->(p)-[:SIBLING]->(s)-[:CHILD]->(c). :) >>>> I see using the PARENT/CHILD pair as way more straightforward to >>>> read/follow than (n)-[:PARENT]->(p)-[:SIBLING]->(s)<-[:PARENT]-(c). :) >>>> >>> >>> It is easier on the eye to read a query that draws you from left to >>> right (in your second example above it does appear as if (s) should bee the >>> subject of or attention), but we are discussing orders of beauty on a data >>> interface that is already extremely intuitive (not that i am advocating >>> procrastination) . >>> >>> >>>> So let's talk about how this can be used to create the >>>> (n)-[:COUSIN]->(c) query; >>>> This would be a new kind of relationship type. >>>> If the Cousin type captured a MATCH query like from above, then it >>>> would be a kindred spirit to a SQL VIEW or FUNCTION as part of the DB >>>> schema. Again it's primarily just creating a syntactic shortcut for a >>>> larger query; but it's one that makes the database more comprehensible to >>>> those interfacing with it. >>>> >>> >>> I've banged on about not worrying about comprehensibility, because I >>> think that we would be masking some of the super powers of graphs by >>> concerning ourselves with directional semantics at the query interface >>> level. However distilling larger traversals into a single pseudo >>> relationship *could* make representing complex queries more manageable >>> (whilst potentially masking some things you might have wanted, i.e cousin >>> on which side?). >>> >>> >>>> >>>> This new relationship type could even be used to create a really >>>> powerful result set reduction replacement function. >>>> Imagine taking a graph result from the prior query and running the >>>> [:COUSIN] relationship replacement function on it. >>>> >>>> So if we take a row from the results from the above query, >>>> e.g. (r1) = "ME", [:PARENT], "MYDAD", [:SIBLING], "DAD'S BROTHER", >>>> [:CHILD], "DAD'S BROTHER'S DAUGHTER" >>>> >>>> and run GraphMapReduce(r1, [:COUSIN]), then anywhere the intermediate >>>> nodes match the query provided by [:COUSIN] they get replaced with the >>>> relationship [:COUSIN]. >>>> Giving us: >>>> (r2) = "ME"-[:COUSIN]->"DAD'S BROTHER'S DAUGHTER" >>>> >>>> If the [:COUSIN] type also provided some kind of MAP/REDUCE type >>>> functions as part it's description/definition, then it could build even its >>>> own properties dynamically from the underlying properties on the >>>> intervening nodes. >>>> >>>> This would be especially useful if we were to create a "correct" >>>> implementation of Cousin which really means everyone you are directly >>>> related to through marriage or parentage. Your third cousin twice removed >>>> for example is still technically a "Cousin", the map/reduce function on >>>> "Cousin" could take the intervening Path P, count the number of Parent hops >>>> involved to get the cousin's "order" and then count the balance of the >>>> parent links up/down to get the cousin's "removal" and make those both >>>> properties part of the "COUSIN" relationship connecting us. >>>> >>>> So the idea is that you pass a sub-graph or a result set to a named >>>> relationship type (which is actually a query). >>>> It operates on each row in the result set, when the row matches the >>>> query, the links between the matching endpoint nodes are replaced by the >>>> relationship and the results of the relationship query become properties on >>>> the relationship. >>>> >>>> So the final results of this would be that we started with this: >>>> (r1) = "ME", [:PARENT], "MYDAD", [:SIBLING], "DAD'S BROTHER", [:CHILD], >>>> "DAD'S BROTHER'S DAUGHTER" >>>> >>>> and reduced it to this: >>>> (r3) = "ME"-[:COUSIN {order: 1, removed: 0}]->"DAD'S BROTHER'S DAUGHTER" >>>> This, at least in my mind's eye, would involve adding a new field to >>>> the relationship type to persist across reboots (though technically it is >>>> still just a "String"). >>>> >>> >>> This is the interesting bit for me, something that you already implied >>> it by calling your Class above the GraphMadReducer would be standardising >>> on how analytic or inferencing models could be applied to your graph and >>> then using this standardised model to build relationships with Hadoop or >>> Spark or your new Apache incubating distributed processing for Graph stores >>> project. Having a local MapReduce function is nice, but it's essentially a >>> fancy name for a CREATE query, MapReduce implies a chunked & distributed in >>> nature. If a job can be described (be it in Cypher, Gremlin, XML, JSON, >>> javascript, R, whatever) thrown at a cluster, chunked up and processed, the >>> results being either returned, stored back to the same graph or stored into >>> a new graph then that's powerful. Sure, you'd use it locally first! >>> >>> What I think would be extra smart about breaking this into a new project >>> is the evolution of an ecosystem where not everything is bundled into the >>> core project taking it beyond what I see as it's job, but that interested >>> parties start building function around it that leverage it's power. >>> >>> I've not used it, but doesn't TinkerPop have some sort of Map Reduce >>> type functionality? >>> >>> M >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to a topic in the > Google Groups "Neo4j" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/neo4j/swNSvsKtqf8/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
