Re: [Neo4j] large datasets

Michael Miller Wed, 11 Jun 2014 16:42:22 -0700

hi michael,
 
thanks much for the reply.
 
"how many paths does this return?


start ds1=node:genNodeIdx('subtype:type2 OR subtype:type3 OR subtype:type4')
match (be1:*Bioentity*
)-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]-(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
*Bioentity*)"

this returns ~1,800,000 paths
 
"start ds1=node:genNodeIdx('subtype:type2 OR subtype:type3 OR 
subtype:type4')"
 
this returns ~65,000 nodes
 
"would perhaps this work too?"
 
actually i don't think so, be1 might have a relationship to some Bioentity 
through type2-4, and that's fine, it just musn't have a relationship to any 
be2 for inclusion in the results.
 
but i did take your advice on my own and created labels and just yesterday 
discovered that the following (although it took 4 hours) did work:
 

match 
(be1:Bioentity)-[:BELONGS_IN]-(:type1)-[:DATA]->()-[:BELONGS_IN]-(be2:Bioentity)
 


where 
not((be1:Bioentity)-[:BELONGS_IN]-(:type2)-[:DATA]-()-[:BELONGS_IN]-(be2:Bioentity)
 
or

(be1:Bioentity)-[:BELONGS_IN]-(:type3)-[:DATA]-()-[:BELONGS_IN]-(be2:Bioentity) 
or 

(be1:Bioentity)-[:BELONGS_IN]-(:type4)-[:DATA]-()-[:BELONGS_IN]-(be2:Bioentity))
 


return distinct be1.identifier, be2.identifier

note that for (:typeN)-[:DATA]-()it is always true in this graph that the 
second Node will always just have the same label typeN  {(:
typeN)-[:DATA]-(:typeN}

which i like, it seems more natural.  it would be really nice to have a 
MINUS operator, like sql or sparql (and INTERSECT!).  i could see that 
underneath there could be a very efficient implemntation that would be 
fast.  better support in neo4j for partitioning based on paths would be 
great.  we're using v2.0.1, we're moving to 2.1.1 very soon.

thanks again, michael

 

On Tuesday, June 10, 2014 10:31:54 AM UTC-7, Michael Hunger wrote:

> Perhaps it makes more sense to handle your subtypes with labels instead?
>
> And I'd love to see a picture :)
>
> NEGATION is alway tricky to handle. 
>
> how many paths does this return?
>
> start ds1=node:genNodeIdx('subtype:type2 OR subtype:type3 OR 
> subtype:type4')
> match (be1:*Bioentity*
> )-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]-(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
> *Bioentity*)
>
> and how many this?
>
> start ds1=node:genNodeIdx('subtype:type2 OR subtype:type3 OR 
> subtype:type4')
>
>
> would perhaps this work too?
>
> start ds1=node:genNodeIdx('subtype:type2 OR subtype:type3 OR 
> subtype:type4')
> match (be1:*Bioentity*
> )-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]-(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
> *Bioentity*)
> with collect(be1) as not_head_nodes, collect(be2) as not_last_nodes,
>
> start ds1=node:genNodeIdx('subtype:type1')
> match (be1:*Bioentity*
> )-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]->(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
> *Bioentity*)
> where be1 not in not_head_nodes AND be2 not in not_last_nodes
> return be1.identifier, be2.identifier
>
> Am 19.05.2014 um 18:33 schrieb Michael Miller <[email protected] 
> <javascript:>>:
>
> hi all,
>  
> we are running into a similar problem, also in the bio research space (not 
> surprising).  although the graph has lots of aspects to it, at the heart we 
> have Bioentity nodes (genes, proteins, etc) that have a BELONG_IN 
> relationship to AnalysisSetSlice nodes and the AnalysisSetSlice nodes have 
> a DATA relationship between them and the AnalysisSetSlice nodes have a 
> subtype property such that only AnalysisSetSlice nodes with the same 
> subtype property can have a DATA relationship between them.  a Bioentity 
> node might BELONG_IN 1-4 AnalysisSetSlice nodes and AnalysisSetSlice nodes 
> might have DATA to 32 other AnalysisSetSlice nodes, .  when we know what 
> Bioentities we are interested in, then the queries are relatively quick but 
> when we want to partition using the DATA relationship on 
> AnalysisSetSlice.subtype is when the query never returns (at least after 
> running for two days) even tho the box is not using all the memory and logs 
> aren't showing any particular problem.  For a smaller test graph, the query 
> does return with the correct nodes.
>  
> there are ~20,000 Bioentities, ~80,000 AnalysisSetSlices divided into 4 
> subtypes of ~20,000 nodes  each.  there are ~2,500,000 DATA relationships 
> so ~32 per analysisSetSlice nodes.  we're using 2.0.1, the java embedded 
> database and issuing cypher queries. the query we ran was to discover what 
> pairs of Bioentity nodes only had a relationship through DATA in one 
> subtype.  here's the query i came up with:
>
> "start ds1=node:genNodeIdx('subtype:type2 OR subtype:type3 OR 
> subtype:type4') 
> match (be1:*Bioentity*
> )-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]-(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
> *Bioentity*) 
> with collect(distinct [be1, be2]) as not_paths 
> start ds1=node:genNodeIdx('subtype:type1') 
> match (be1:*Bioentity*
> )-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]->(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
> *Bioentity*) 
> where none(not_path in not_paths where be1 = head(not_path) and be2 = 
> last(not_path)) 
> return be1.identifier, be2.identifier"
>
> the problem, it looks like, is that this is 'order n squared', not ideal.  
> what i wanted was the graph equivalent of the relational MINUS operator, i 
> think.  when i remove the where clause, that query doesn't take long at 
> all.  is there a better way to formulate this query?
>  
> thanks,
> michael
>  
> On Sunday, May 18, 2014 9:06:29 AM UTC-7, Alex Frieden wrote:
>
>> Hi guys,
>> My group is starting to get into pretty large datasets.  I was wondering 
>> if users can take about their large datasets and how they handled dealing 
>> with.  By large I am talking about a neo4j database over 1TB.  However, any 
>> stories of scaling data would be useful.  Thanks!
>>
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] large datasets

Reply via email to