Re: [Neo4j] Re: Automatically Creating/Parsing inverse relationships

Michael Fair Wed, 10 Sep 2014 20:28:01 -0700

Definitely food for thought and the ideas definitely have merit!  I was
once developing an algorithm to identify groups based on clusters of
in-common friends links; and as you touched in your post;
naming/identifying these fledging anonymous groups was one of the first
things I had to deal with. I hadn't even considered adding an
optimizer/classifier to detect often matched relationships.  I mean it
makes sense "these two nodes with this intervening path comes up often in
queries; perhaps we can ask to name it?"


The other cool thing you touched on was a practical mechanism for how to
track and manage these derived graphs; basically "create a new graph for
them!".  That's way better (and not too mention easier) than what I was
thinking of.  I could even see using the hooks mechanism to create and
maintain these derived graph links.

As far as applications go; might brain breaks if I think about it too much;
the implications are way too staggering.  Especially if the thing actually
performs even half-way decently!  What's also great is family trees are
complicated enough; yet has answers that are thouroughly comprehended
enough to model the "correct" behaviour for a lot of this derived link
stuff!

In the meantime, what's fantastic is I think neo4j already has the
necessary query infrastructure for expanding these sub-queries at runtime
through CYPHER's support of multiple starting points using aggregates from
earlier clauses.  I'm thinking that's going to help.

Mike
On Sep 10, 2014 7:09 PM, "Peter Hunsberger" <[email protected]>
wrote:

> This caught my eye because it touches on an area that I have a deep
> interest in, namely the higher levels of classification of data sets and
> the generalization of metadata as data within graph databases.  The
> addition of higher level node types rapidly gets into the exploration of
> hypergraphs  and you may want to Google a bit on some of those
> implementations and the resultant implications. In particular, super nodes
> in any large graph.  IE; in your example, gender would divide any normally
> distributed population in half and not be much use for  partitioning.
> However, I digress...
>
> The whole concept of overlaying metadata on top of another graph has
> several standard use cases: type systems, security / authorization systems,
> schema, ontologies , etc.   The backwards derivation of the latter of these
> is particularly interesting.  Consider a small graph that builds on your
> example: SIBLING, PARENT, MOTHER, FATHER, BROTHER, SISTER,  etc.  It's easy
> to see how a separate graph of these relationships could be manually
> defined and the edges from these to some population might result in useful
> indexes to optimize searches.  More interesting would be watching queries
> go by and deriving these concepts: an optimizer / classifier might not know
> the name of the resultant concepts, but persisting the repeatedly used
> sub-query results as an index would yield metadata that can be used for
> further analytics if one so desired.  EG; we may not have a simple name for
> Friend of a Friend once removed (like 2nd cousin) but the concept may still
> be useful within some domain...
>
> I think the key here would be identification of common or dense sub-query
> results.  Typically, that might start as some form of cache where we can
> spot high reuse of some set of items  The grouping of similar versions of
> these cached items would then start to point at patterns (eg. all sub query
> results using the SIBLING relationship)  Manual inspection would be the
> last step: these two subgroups statistically divide this population in
> half, further inspection shows that is because they diverge on the GENDER.
>  We now can label these sub-query results as BROTHER and SISTER.. Under the
> covers "naming" these results might actually be the naming of nodes of a
> metadata graph or hyper graph that is associated with the original graph
> and this is in turn initially a way of identifying the subquery results the
> system found to have a high incidence in the cache.
>
> I'm mostly guessing at possible implementations, but maybe food for
> thought for someone...
>
> Peter Hunsberger
>
> On Tue, Sep 9, 2014 at 10:14 PM, Michael Fair <[email protected]>
> wrote:
>
>>
>> On Thursday, September 4, 2014 2:26:39 AM UTC-7, Mark Findlater wrote:
>>>
>>> hood stuff as I do not know enough about how Neo maintains it's indexes,
>>> how the query/relationship/object caches work and whether there is any
>>> performance impact of creating twice as many relationships (I expect that
>>> if I do not get smarter with my data I would reach maximum capacity of Neo
>>> in 3 years, if I doubled the relationships I suppose I would half that, but
>>> that is very naive!).
>>>
>>
>> One possible confusion point; the process wouldn't create twice as many
>> relationships between nodes, or even twice as many relationship types; it
>> would only create twice as many relationship type labels (and that's
>> assuming every type got an inverse defined on it (which not everything
>> would)).
>>
>> So maybe you go from 50 type labels or 100 or even 500 type labels to
>> twice that but the actual data in the graph itself doesn't change at all,
>> there's just a few more string metadata entries about the graph.  I know
>> you get this as what's meant by "not-persisting" the information but I
>> wanted to be clear.  We have to persist _something_; so this process is
>> only persisting additional labels for the relationship types.
>>
>> That said; this whole conversation, especially your comments about
>> formalizing this into a distributed processing system, has got me thinking
>> that what we really ought to do here is take on thinking of every
>> relationship type as a named sub-query and existing types are a specific
>> trivial case.  See below.
>>
>>
>>
>>> Running with a team of developers I would not expect all or even many of
>>> them to be operating on raw queries and similar to not producing inverse
>>> methods all over the place (although you could cite Commons StringUtils
>>> isEmpty/isNotEmpty type design) I expect people to be comfortable with
>>> negation, or applying more complex logical analysis. Which is what I meant
>>> about exposing the raw query interface to the world at which point I might
>>> expect less of what people are comfortable with!
>>>
>>
>> That's interesting, because I actually expect less complex and domain
>> specific applications that expose more of the framework that enable
>> repeating common usage patterns easy, making it easier for users to
>> explore, navigate, add and composite the data and create new forms within
>> their own graphs. :)
>> The application developers are going to know less and less about the
>> specific schemas their users are using over time and focus more and more on
>> providing good analysis and patterning tools that enable users to take more
>> advantage of their graphs.  :)
>>
>>
>>
>>> If you could define the inferred relationships such that the planner
>>> could use them (he said :SIBLING but he means :SISTER or :BROTHER) or, as
>>> you touched on below defining the rules and then explicitly running a
>>> MapReduce type function to expose the data sound very interesting.
>>>
>>
>> So I think this where you've struck upon the BIG IDEA!
>> I'm so glad you mentioned this as you're right; I was thinking way too
>> small with this idea!
>> You're spot on in thinking this should help out the distributed
>> applications of the "Query Graph Search/Reducer" application.
>>
>> If :SIBLING was one of these new "types as sub-query" things; and it was
>> defined as something like
>> [s:SIBLING]
>>     MATCH path = (n)<-[:CHILD_OF]-(parent)-[:CHILD_OF]->(m)
>>     WHERE n != m
>>     SET s.gender = m.gender, s.parents = collect(DISTINCT parent)
>>     RETURN n, s, m
>>
>> then it could absolutely work as you described.
>>
>> And as :SIBLING is now a named query type; defined directly in the
>> database; if the db architect (or stats tracker process in the DB engine)
>> decided this relationship was something worth optimizing for, then there's
>> probably a way the engine could treat it like both an INDEX and a
>> RELATIONSHIP, where the [:SIBLING] relationship can be dynamically rebuilt
>> and maintained as needed.
>> Anytime a [:CHILD_OF] relationship is created/deleted, the engine could
>> know/detect that it needs to update the [:SIBLING] index on the updated
>> nodes.
>>
>> Building on this idea, the [:SIBLING] relationship could then be used to
>> then define other relationships like [:SISTER] and [:BROTHER].
>> [:SISTER] <= MATCH (n)<-[s:SIBLING {gender: "female}]-(m) RETURN s.parents
>> [:BROTHER] <= MATCH (n)<-[s:SIBLING {gender: "male}]-(m) RETURN s.parents
>> [:STEPSIBLING] <= MATCH (n)<-[s:SIBLING]-(m)-[:PARENT]->(p) WHERE p not
>> in s.parents RETURN s.gender, s.parents
>>
>> ===============================
>> Now that we've got queries; let's talk about reducing the results.
>> Query:
>> MATCH path = (n)<--(p)-->(m)
>> RETURN graphReduce(path, [:BROTHER|:SISTER|:SIBLING])
>>
>> So this would take a result set that starts like this [from
>> (n)<--(p)-->(m)]:
>> (Johnny {gender: male})<-[:STUDENT_OF]-(Sandra)-[:STUDENT_OF]->(Sally
>> {gender: female})
>> (Johnny {gender: male})<-[:CHILD_OF]-(Bob)-[:CHILD_OF]->(Sally {gender:
>> female})
>> (Johnny {gender: male})<-[:CHILD_OF]-(Alice)-[:CHILD_OF]->(Sally {gender:
>> female})
>> (Sally {gender: female})<-[:CHILD_OF]-(Bob)-[:CHILD_OF]->(Johnny {gender:
>> male})
>> (Sally {gender: female})<-[:CHILD_OF]-(Alice)-[:CHILD_OF]->(Johnny
>> {gender: male})
>>
>> and return:
>>
>> (Johnny {gender: male})-[:SISTER {parents: (Bob, Alice)}]->(Sally
>> {gender: female})
>> (Johnny {gender: male})-[:SIBLING {gender: female, parents: (Bob,
>> Alice)}]->(Sally {gender: female})
>> (Sally {gender: female})-[:BROTHER {parents: (Bob, Alice)}]->(Johnny
>> {gender: male})
>> (Sally {gender: female})-[:SIBLING {gender: male, parents: (Bob,
>> Alice)}]->(Johnny {gender: male})
>>
>> I don't have the exact semantics nailed down for what happens if multiple
>> relationship matches are found, I think it should just collect them and
>> return them all in no particular order as that is likely the most
>> parallelizable and distributed operation, as well as the most to definable.
>>
>> So using this model, [:PARENT_TO] (the original inverse relationship
>> goal) could be something like:
>> [p:PARENT_TO] <=
>>     MATCH (m)<-[c:CHILD_OF]-(n)
>>     SET p = c
>>     RETURN n, p, m
>>
>> What do you think?
>>
>> Mike
>>
>> On Wednesday, 3 September 2014 21:52:57 UTC+1, Michael Fair wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> On Wednesday, September 3, 2014 3:29:34 AM UTC-7, Mark Findlater wrote:
>>>>>
>>>> Hey Mike I think that this is interesting and it reminds me of using
>>>>> the OWL *inverseOf *property relationship and other features of the
>>>>> inferencing engine (overview
>>>>> <http://www.google.com/url?q=http%3A%2F%2Fwww.w3.org%2FTR%2F2004%2FREC-owl-features-20040210%2F&sa=D&sntz=1&usg=AFQjCNHzDx_ohrhoiPKvfdY6uyCekmPqtA>
>>>>> , spec <http://www.w3.org/TR/owl-ref/#inverseOf-def>). Or stepping
>>>>> further back (in my learning) defining rules in Prolog.
>>>>>
>>>>
>>>> This is definitely about making the schema/engine smarter about at
>>>> "semantics".
>>>>
>>>>
>>>>
>>>>> I guess that my question as a developer is why I would want the
>>>>> inverse relationship label defined?
>>>>>
>>>>
>>>> I can see where a single app written by a single developer or two who
>>>> are all the combined schema master, data integrity maintainer, and code
>>>> author likely wouldn't need "help" to translate into their schema to query
>>>> it; they wrote it and likewise if they want to extend the app, they can
>>>> just write more code to extend the app; if they need a new relationship,
>>>> they just invent it.  I'd argue that even at that small scale, while it's
>>>> not "needed" it's still helpful.
>>>>
>>>>
>>>>
>>>>> With the triple stores I have used there is usually an inference
>>>>> engine which then expose to queries the union of both the asserted triples
>>>>> (the data that you explicitly added) and the inferred triples (the ones
>>>>> that rules were used to create). The important thing about the inferred
>>>>> triples is that they are ephemeral meaning that they do not impact the
>>>>> underlying datastore (although you could choose the materialize them) 
>>>>> which
>>>>> stops unnecessary bloat (at the cost of processing).
>>>>>
>>>>
>>>> Well storage-wise, the db is storing a numeric relationship type
>>>> identifier, and then separately storing the relationship type label
>>>> information.
>>>>
>>>> So storage-wise, adding multiple relationship type entries for the same
>>>> type id isn't a problem.
>>>> The on-disk structure is totally capable of handling this, it's just
>>>> changing the type of block pointed to in the RelatioshipTypeStore from
>>>> String to String[].
>>>> It's the code that loads this from the disk that would need to be
>>>> updated to understand it's getting a String[] back and not just a String.
>>>>
>>>> I'm now however convinced that updating the code to detect a String
>>>> versus a String[] in the RelationshipTypeStore is going to be way easier
>>>> than all the splitting and parsing gymnastics I was thinking of earlier! :)
>>>> Though I still might encode the "direction" as part of the string.
>>>>
>>>>
>>>> As these new relationships are sugar which is ultimately bloat would
>>>>> you ever want to persist them? I also suspect that the directional trick
>>>>> you suggest could have serious implications to the performance of the 
>>>>> query
>>>>> engine (but that's just a hunch).
>>>>>
>>>>
>>>> We can query (a)<-[:REL]-(b) for the same performance as
>>>> (a)-[:REL]->(b) so the only trick is getting the planner to know what's
>>>> being expressed, and that's what I think the parser's job is.
>>>>
>>>> All relationships in neo4j have a from/to direction by design.  The
>>>> label for that direction is rather arbitrary created at design time. Being
>>>> able to define the inverse labels for a relationship type eliminates that
>>>> required design time arbitrary selection and I can't see how it has any
>>>> performance impact.  The planner still knows it's looking for a numeric
>>>> relationship id (e.g. 62), and what it needs to know about from and to; I
>>>> suspect most of the heavy lifting on that was created in the parser.  From
>>>> what I've gathered a "directionless" relationship match actually becomes
>>>> two matches under the hood, one in each direction.
>>>> This would just make a similar kind of translation.
>>>>
>>>>
>>>>
>>>> The Sibling/Spouse example has limited mileage for me without some
>>>>> clever chaining (e.g. how would I represent cousins? Harden SIBLING and
>>>>> PARENT/SPOUSE relationships and then chain SIBLING->PARENT->SIBLING->CHILD
>>>>> nodes, hardening along the way?) and again I think I would see more value
>>>>> in adding this a level above the underlying store.
>>>>>
>>>>
>>>> You jumped the gun on me! :)
>>>> I'm actually working on a "relationships as sub-queries" plan and
>>>> [:COUSIN], [:NIECE], [:NEPHEW] are perfect examples of that.
>>>> I guess I just ought to have simply proposed these named sub-queries as
>>>> part of the initial discussion.  :)
>>>>
>>>> It's cool that you used the PARENT/CHILD pairing in the example
>>>> (n)-[:PARENT]->(p)-[:SIBLING]->(s)-[:CHILD]->(c). :)
>>>> I see using the PARENT/CHILD pair as way more straightforward to
>>>> read/follow than (n)-[:PARENT]->(p)-[:SIBLING]->(s)<-[:PARENT]-(c). :)
>>>>
>>>
>>> It is easier on the eye to read a query that draws you from left to
>>> right (in your second example above it does appear as if (s) should bee the
>>> subject of or attention), but we are discussing orders of beauty on a data
>>> interface that is already extremely intuitive (not that i am advocating
>>> procrastination) .
>>>
>>>
>>>> So let's talk about how this can be used to create the
>>>> (n)-[:COUSIN]->(c) query;
>>>> This would be a new kind of relationship type.
>>>> If the Cousin type captured a MATCH query like from above, then it
>>>> would be a kindred spirit to a SQL VIEW or FUNCTION as part of the DB
>>>> schema.  Again it's primarily just creating a syntactic shortcut for a
>>>> larger query; but it's one that makes the database more comprehensible to
>>>> those interfacing with it.
>>>>
>>>
>>> I've banged on about not worrying about comprehensibility, because I
>>> think that we would be masking some of the super powers of graphs by
>>> concerning ourselves with directional semantics at the query interface
>>> level. However distilling larger traversals into a single pseudo
>>> relationship *could* make representing complex queries more manageable
>>> (whilst potentially masking some things you might have wanted, i.e cousin
>>> on which side?).
>>>
>>>
>>>>
>>>> This new relationship type could even be used to create a really
>>>> powerful result set reduction replacement function.
>>>> Imagine taking a graph result from the prior query and running the
>>>> [:COUSIN] relationship replacement function on it.
>>>>
>>>> So if we take a row from the results from the above query,
>>>> e.g. (r1) = "ME", [:PARENT], "MYDAD", [:SIBLING], "DAD'S BROTHER",
>>>> [:CHILD], "DAD'S BROTHER'S DAUGHTER"
>>>>
>>>> and run GraphMapReduce(r1, [:COUSIN]), then anywhere the intermediate
>>>> nodes match the query provided by [:COUSIN] they get replaced with the
>>>> relationship [:COUSIN].
>>>> Giving us:
>>>> (r2) = "ME"-[:COUSIN]->"DAD'S BROTHER'S DAUGHTER"
>>>>
>>>> If the [:COUSIN] type also provided some kind of MAP/REDUCE type
>>>> functions as part it's description/definition, then it could build even its
>>>> own properties dynamically from the underlying properties on the
>>>> intervening nodes.
>>>>
>>>> This would be especially useful if we were to create a "correct"
>>>> implementation of Cousin which really means everyone you are directly
>>>> related to through marriage or parentage.  Your third cousin twice removed
>>>> for example is still technically a "Cousin", the map/reduce function on
>>>> "Cousin" could take the intervening Path P, count the number of Parent hops
>>>> involved to get the cousin's "order" and then count the balance of the
>>>> parent links up/down to get the cousin's "removal" and make those both
>>>> properties part of the "COUSIN" relationship connecting us.
>>>>
>>>> So the idea is that you pass a sub-graph or a result set to a named
>>>> relationship type (which is actually a query).
>>>> It operates on each row in the result set, when the row matches the
>>>> query, the links between the matching endpoint nodes are replaced by the
>>>> relationship and the results of the relationship query become properties on
>>>> the relationship.
>>>>
>>>> So the final results of this would be that we started with this:
>>>> (r1) = "ME", [:PARENT], "MYDAD", [:SIBLING], "DAD'S BROTHER", [:CHILD],
>>>> "DAD'S BROTHER'S DAUGHTER"
>>>>
>>>> and reduced it to this:
>>>> (r3) = "ME"-[:COUSIN {order: 1, removed: 0}]->"DAD'S BROTHER'S DAUGHTER"
>>>> This, at least in my mind's eye, would involve adding a new field to
>>>> the relationship type to persist across reboots (though technically it is
>>>> still just a "String").
>>>>
>>>
>>>  This is the interesting bit for me, something that you already implied
>>> it by calling your Class above the GraphMadReducer would be standardising
>>> on how analytic or inferencing models could be applied to your graph and
>>> then using this standardised model to build relationships with Hadoop or
>>> Spark or your new Apache incubating distributed processing for Graph stores
>>> project. Having a local MapReduce function is nice, but it's essentially a
>>> fancy name for a CREATE query, MapReduce implies a chunked & distributed in
>>> nature. If a job can be described (be it in Cypher, Gremlin, XML, JSON,
>>> javascript, R, whatever) thrown at a cluster, chunked up and processed, the
>>> results being either returned, stored back to the same graph or stored into
>>> a new graph then that's powerful. Sure, you'd use it locally first!
>>>
>>> What I think would be extra smart about breaking this into a new project
>>> is the evolution of an ecosystem where not everything is bundled into the
>>> core project taking it beyond what I see as it's job, but that interested
>>> parties start building function around it that leverage it's power.
>>>
>>> I've not used it, but doesn't TinkerPop have some sort of Map Reduce
>>> type functionality?
>>>
>>> M
>>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "Neo4j" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/neo4j/swNSvsKtqf8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Automatically Creating/Parsing inverse relationships

Reply via email to