[Neo4j] Re: large datasets

Michael Miller Tue, 20 May 2014 09:28:39 -0700

hi all,
 
we are running into a similar problem, also in the bio research space (not 
surprising).  although the graph has lots of aspects to it, at the heart we 
have Bioentity nodes (genes, proteins, etc) that have a BELONG_IN 
relationship to AnalysisSetSlice nodes and the AnalysisSetSlice nodes have 
a DATA relationship between them and the AnalysisSetSlice nodes have a 
subtype property such that only AnalysisSetSlice nodes with the same 
subtype property can have a DATA relationship between them.  a Bioentity 
node might BELONG_IN 1-4 AnalysisSetSlice nodes and AnalysisSetSlice nodes 
might have DATA to 32 other AnalysisSetSlice nodes, .  when we know what 
Bioentities we are interested in, then the queries are relatively quick but 
when we want to partition using the DATA relationship on 
AnalysisSetSlice.subtype is when the query never returns (at least after 
running for two days) even tho the box is not using all the memory and logs 
aren't showing any particular problem.  For a smaller test graph, the query 
does return with the correct nodes.
 
there are ~20,000 Bioentities, ~80,000 AnalysisSetSlices divided into 4 
subtypes of ~20,000 nodes  each.  there are ~2,500,000 DATA relationships 
so ~32 per analysisSetSlice nodes.  we're using 2.0.1, the java embedded 
database and issuing cypher queries. the query we ran was to discover what 
pairs of Bioentity nodes only had a relationship through DATA in one 
subtype.  here's the query i came up with:


"start ds1=node:genNodeIdx('subtype:type2 OR subtype:type3 OR 
subtype:type4') 
match (be1:*Bioentity*
)-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]-(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
*Bioentity*) 
with collect(distinct [be1, be2]) as not_paths 
start ds1=node:genNodeIdx('subtype:type1') 
match (be1:*Bioentity*
)-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]->(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
*Bioentity*) 
where none(not_path in not_paths where be1 = head(not_path) and be2 = 
last(not_path)) 
return be1.identifier, be2.identifier"

the problem, it looks like, is that this is 'order n squared', not ideal.  
what i wanted was the graph equivalent of the relational MINUS operator, i 
think.  when i remove the where clause, that query doesn't take long at 
all.  is there a better way to formulate this query?
 
thanks,
michael
 
On Sunday, May 18, 2014 9:06:29 AM UTC-7, Alex Frieden wrote:

> Hi guys,
> My group is starting to get into pretty large datasets.  I was wondering 
> if users can take about their large datasets and how they handled dealing 
> with.  By large I am talking about a neo4j database over 1TB.  However, any 
> stories of scaling data would be useful.  Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Re: large datasets

Reply via email to