hi all,
we are running into a similar problem, also in the bio research space (not
surprising). although the graph has lots of aspects to it, at the heart we
have Bioentity nodes (genes, proteins, etc) that have a BELONG_IN
relationship to AnalysisSetSlice nodes and the AnalysisSetSlice nodes have
a DATA relationship between them and the AnalysisSetSlice nodes have a
subtype property such that only AnalysisSetSlice nodes with the same
subtype property can have a DATA relationship between them. a Bioentity
node might BELONG_IN 1-4 AnalysisSetSlice nodes and AnalysisSetSlice nodes
might have DATA to 32 other AnalysisSetSlice nodes, . when we know what
Bioentities we are interested in, then the queries are relatively quick but
when we want to partition using the DATA relationship on
AnalysisSetSlice.subtype is when the query never returns (at least after
running for two days) even tho the box is not using all the memory and logs
aren't showing any particular problem. For a smaller test graph, the query
does return with the correct nodes.
there are ~20,000 Bioentities, ~80,000 AnalysisSetSlices divided into 4
subtypes of ~20,000 nodes each. there are ~2,500,000 DATA relationships
so ~32 per analysisSetSlice nodes. we're using 2.0.1, the java embedded
database and issuing cypher queries. the query we ran was to discover what
pairs of Bioentity nodes only had a relationship through DATA in one
subtype. here's the query i came up with:
"start ds1=node:genNodeIdx('subtype:type2 OR subtype:type3 OR
subtype:type4')
match (be1:*Bioentity*
)-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]-(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
*Bioentity*)
with collect(distinct [be1, be2]) as not_paths
start ds1=node:genNodeIdx('subtype:type1')
match (be1:*Bioentity*
)-[:BELONG_IN]-(ds1:AnalysisSetSlice)-[:DATA]->(ds2:AnalysisSetSlice)-[:BELONG_IN]-(be2:
*Bioentity*)
where none(not_path in not_paths where be1 = head(not_path) and be2 =
last(not_path))
return be1.identifier, be2.identifier"
the problem, it looks like, is that this is 'order n squared', not ideal.
what i wanted was the graph equivalent of the relational MINUS operator, i
think. when i remove the where clause, that query doesn't take long at
all. is there a better way to formulate this query?
thanks,
michael
On Sunday, May 18, 2014 9:06:29 AM UTC-7, Alex Frieden wrote:
> Hi guys,
> My group is starting to get into pretty large datasets. I was wondering
> if users can take about their large datasets and how they handled dealing
> with. By large I am talking about a neo4j database over 1TB. However, any
> stories of scaling data would be useful. Thanks!
>
--
You received this message because you are subscribed to the Google Groups
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.