Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

gg4u Sun, 09 Nov 2014 10:03:30 -0800

Hi Michael, 

My conclusion is that I used super-fast importer and created a faulty 
data-structre. I see super fast batch-importer is now removed from git ?


Still, I need your de-brief for clarifying confusion in the document ion of 
batch-importer for indexing nodes' properties, complying with the doc in 
neo 2.1.5 .

After batch-import, I had no schema.
I also am confused by the meaning of schema, which shows indexes on 
NodeLabel(NodeProperties), and index --indexes command, which show indexes 
on the name I gave to the indexes.

I don't know exactly how to use them...


I will describe step by step which tool i used, configurations.
Hope this doc will be useful to other people too.

Please see questions below, maybe answers will be helpful to pin down where 
there is my hiccup in setting up an import with properly set up indexes.



*Hypothesis*
I think the issue is in the generation of schema and indexes wi
https://github.com/jexp/batch-import/tree/20th batch-import.

Using neo 2.1.5, how to generate a schema:
http://neo4j.com/docs/stable/graphdb-neo4j-schema.html#graphdb-neo4j-schema-constraints

*Schema Indexes * in documentation at 
https://github.com/jexp/batch-import/tree/20: 
it says I should pre-construct the db and create the schema upfront if I 
want to use Schema Indexes.

I dunno how to generate a schema upfront without creating nodes, thus 
import my nodes:
maybe simply doing, *upfront, *
create index on: Topic(name) 
?
(for who jumped in the conversation only now, 'Topic' and 'name' are the 
Label and nodes' properties, see below 'My indexes')


*Tools*
Leaved alone the super-batch importer.

Instead of 
https://github.com/jexp/batch-import

I used the version for neo > 2.0
https://github.com/jexp/batch-import/tree/20

I downloaded from git to used batch.importer file, and used the lib folder 
at: download zip 
<https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip>


*My indexes*
*In headers for nodes.csv and rels.csv*

*Nodes.csv*

id:int:source_id     NodeType:label       name:string:topic
As label for NodeType, I set : "Topic" (capital 'T'): e.g.:

3998932     Topic       Neo4J Traversing test


*Rels.csv*

id:int:source_id      id:int:source_id     type    proximity:int




*Batch.properties*
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=5M
neostore.propertystore.db.index.mapped_memory=5M
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=2G
neostore.propertystore.db.mapped_memory=200M
neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.strings.mapped_memory=200M
batch_array_separator=,
batch_import.csv.quotes=false
#batch_import.csv.delim=,
*batch_import.node_index.source_id=exact*
*batch_import.node_index.topic=fulltext*

*Neo4j.properties*

*NOT clear what Auto-index is for:*
does it index nodes' properties   ? full-text, int, arrays as well ?

Could you please clarify if, according to doc in git and doc in neo 2.1.5:

Are the headers for my cvs columns properly set for using legacy indexes or 
2.0> indexes ?
should I use auto_indexes to index full_text 'topic' and 'source_id' ?
If so, could you please tell me which syntax to do it for headers and 
batch.properties?

In neo4.properties, what is a key? The property name, or the name of the 
index? I  my example:

# The node property keys to be auto-indexed, if enabled
node_keys_indexable= name, id 
or
node_keys_indexable= source_id, topic 
?



*Neo4j-wrapper.conf*
Setting memory heap to 4G (min):

wrapper.java.initmemory=4096
#wrapper.java.maxmemory=512




*Results*
in the http://localhost:7474/webadmin/#/console/ 

After Import, I found *schema *is not fixed:

*neo4j-sh (?)$ schema*
*==> No indexes*
*==> *
*==> No constraints*


while:

*neo4j-sh (?)$ index --indexes*
*==> Node indexes:*
*==>   topic*
*==>   source_id*



*Are these indexes meant to use only with schema?*
*I supposed yes...*


Query: 
MATCH (n {name : "Topic1"}) , (m {name : "Topic2"}),
p = allShortestPaths((n)-[*..2]-(m))
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC

*12 rows*
*~11K ms*

Query:
MATCH p = (*n:Topic*)-[*0..2]-(*m:Topic*) where n.name = 'My topic name1' 
and m.name= 'My topic name2'
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC  LIMIT 6;

*Had to abort: too long time (several minutes)*


While using internal ID:

MATCH p = allShortestPaths((n)-[*..4]-(m))
*where ID(n) = 103105 and ID(m) = 2513520*
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC

*229 rows*
*~1.5K ms*

*and *

*229 rows in 134ms if cached*



So hiccup is in matching the property:

Query:
MATCH (a) where a.name = 'My topic name' return ID(a)

*1 row in 5K ms*


while:

MATCH p = allShortestPaths((n)-[*..4]-(m))
*where ID(n) = 103105 and ID(m) = 1386672*
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC

*returns 9 rows in 62ms*


*Also using my indexed source_id will take long time (id:int:source_id) , 
as well as full-text indexed names (name:string:topic)*

MATCH p = allShortestPaths((n)-[*..4]-(m))
where n.id = 1092923 and ID(m) = 21245
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC

*649 rows in 8K ms*



*Trying to resolve Schema*
I tried to fix the schema *after* import:

http://neo4j.com/docs/stable/rest-api-schema-indexes.html

adding 
CREATE INDEX ON: Topic (name)

Now i have:

schema
ON :Topic(name) ONLINE

and I see that:


MATCH p = allShortestPaths((n*:Topic*)-[*..4]-(m*:Topic*))
*where n.name* = 'MyTopicName1' and m.name = 'MyTopicName2'
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC

*9 rows in 70 ms FUCK YEAH! *


*Please note that:*
CREATE INDEX ON: Topic(id)

where 'id' is the internal id given for the topics: see headers in csv. 

results in:

no data returned.

Is there a conflict with the word 'id' given to name a property ?


*Testing LABEL AND Indexed Properties with (proper?) schema AGAINST 
Internal ID*

Queries that not use 'AllShortPaths' , they will resolve, still in one 
order of magnitude longer than using internal ID:

**Using Label AND Indexed Property **

MATCH p = *(n:Topic)*-[*0..2]-*(m:Topic)* where* n.name* = 'Topic1' and* 
m.name*= 'Topic2'
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC  LIMIT 6;

*6 rows in 32253 ms*

**Using Label AND Internal ID **

MATCH p = (n:Topic)-[*0..2]-(m:Topic) where ID(n) = 4115407 and ID(m) = 
667541
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC  LIMIT 6;

*6 rows in 5K ms*

**Using only Internal ID **

MATCH p = (n)-[*0..2]-(m) where ID(n) = 4115407 and ID(m) = 667541
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity DESC  LIMIT 6;

*6 rows in 16Kms*


1. Could you please explain why this query takes longer than 
AllShortestPath ?
2. Could you please explain why a query with Label And Internal ID seem to 
be faster than query of Internal ID ?
I understood that Labels works with indexes... cannot figure out if and why 
should they matter with internal ID. ... 



*Conclusion*
There has been an improvement: my previous data structure was likely not 
imported properly, because matching nodes by internal ID was taking long 
time as well.

And also trying to fix the schema with:
CREATE INDEX ON: MyNodeLabel(MyNodeProperty)


The issue (probably) was that I used super-batch importer rather than 
batch-importer (I cannot remember exactly now, two months are passed :)


I used now the batch importer, version for neo > 2.0.
I see that matching by internal ID starts to produce results as I expect, 
that fixing schema afterwards should produce "regular" results too.

Despite it looks possible to create an optional schema afterwards, I would 
like to learn how to properly index nodes' properties while importing them. 
I think I am still confused with the concept of indexing, the ones shown in 
Schema and index --indexes..

For who jump in the discussion now, do read this!
http://nigelsmall.com/neo4j/index-confusion

However, it is not at all clear to me how to use batch-importer tools to 
properly import indexes against schema for neo 2.1.5 :

*Are the indexes created with batch-importer Legacy indexes (meant to use 
before 2.0) or indexes complying with 2.0 ?*

*Which is correct syntax and steps of batch-import, to import indexes for 
fast queries (full-text + exact) in neo 2.1.5, thus optional schema?*

*What is auto-index meant to do: does it auto-index (exact and full-text) 
nodes' properties and create schema against neo 2.1.5 ? If so, which is 
correct syntax?*

*If auto-index was not meant for legacy, could you please elaborate on 
examples for auto-indexing in github ?*

*Could you also include an example for importing an array of properties for 
nodes with batch importer, so that to index all the properties in the array 
for full-text either exact search ?*

*Which is the largest set of nodes of a certain Type (Label) to be indexed 
by a property of theirs for full-text search, so that to have an efficient 
time to query the db?  *(E.g. here I have 4M nodes ALL of a 'Topic' type. 
Can this set scale to 100M ? Are there some benchmarks ? )
 



Meanwhile, big thanks!


Il giorno sabato 8 novembre 2014 03:06:15 UTC+1, Michael Hunger ha scritto:
>
> You didn't mention before that you used the "superfast" batch-inserter, I 
> think that version is still work in progress, not sure if it creates a 
> normal store.
>
>
> I used my own batch-inserter  github.com/jexp/batch-import
> with these batch.properties:
>
> dump_configuration=false
> cache_type=none
> use_memory_mapped_buffers=true
> neostore.propertystore.db.index.keys.mapped_memory=5M
> neostore.propertystore.db.index.mapped_memory=5M
> neostore.nodestore.db.mapped_memory=200M
> neostore.relationshipstore.db.mapped_memory=2G
> neostore.propertystore.db.mapped_memory=100M
> neostore.relationshipgroupstore.db.mapped_memory=10M
> neostore.propertystore.db.strings.mapped_memory=100M
> batch_array_separator=,
> batch_import.csv.quotes=false
> #batch_import.csv.delim=,
> #batch_import.node_index.source_id=exact
> #batch_import.node_index.topic=fulltext
>
>
> Importing 111111001 Relationships took 478 seconds 
> Total import time: 520 seconds  
>
> Then running your queries, actually without the second limit:
>
> | [Node[103105]{name:"1963-64 Austrian football 
> championship"},:My_Proximity[102026221]{proximity:13},Node[2513520]{name:"Cowley
>  
> plant"},:My_Proximity[108842982]{proximity:28},Node[5523128]{name:"Kinzirô 
> Miyake"}]                                                                   
>                                                 | 41            |
> | [Node[103105]{name:"1963-64 Austrian football 
> championship"},:My_Proximity[102026221]{proximity:13},Node[2513520]{name:"Cowley
>  
> plant"},:My_Proximity[25343932]{proximity:27},Node[9598046]{name:"Suzdal 
> Urban Settlement"}]                                                         
>                                                   | 40            |
> | [Node[103105]{name:"1963-64 Austrian football 
> championship"},:My_Proximity[102026221]{proximity:13},Node[2513520]{name:"Cowley
>  
> plant"},:My_Proximity[108581215]{proximity:13},Node[2627627]{name:"DSFA"}] 
>                                                                             
>                                                 | 26            |
> | [Node[103105]{name:"1963-64 Austrian football 
> championship"},:My_Proximity[102026221]{proximity:13},Node[2513520]{name:"Cowley
>  
> plant"}]                                                                   
>                                                                             
>                                                 | 13            |
> | [Node[103105]{name:"1963-64 Austrian football championship"}]           
>                                                                             
>                                                                             
>                                                                             
>                           | 0             |
>
> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> 2241 rows
> 129 ms
>
> Add this to the config of the server: 
> 4G heap
>
>
> # Default values for the low-level graph engine
>
> neostore.nodestore.db.mapped_memory=250M
>
> neostore.relationshipstore.db.mapped_memory=500M
>
> neostore.propertystore.db.mapped_memory=250M
>
> neostore.propertystore.db.strings.mapped_memory=250M
>
>
> can you try this?
>
> also add this to your neo4j.properties
> neostore.relationshipgroupstore.db.mapped_memory=10M
>
>
>
>
>
> On Fri, Oct 3, 2014 at 11:43 PM, gg4u <[email protected] <javascript:>> 
> wrote:
>
>> Hi,
>>
>> here my new answer, I got into this issue:
>>
>> I have a large weighted graph with only one schema index on nodes (Topic):
>> 4M topics and 100M rels.
>>
>> I wanted to find paths between two given nodes.
>>
>> I tried out with queries like this one:
>> since it is a weighted graph, I compute the weighted path between nodes 
>> as the sum of its weight (I called weight 'proximity' here).
>>
>> Problem is, a query of this type, on such a large graph, tooks ages:
>>
>> Note that using an index, either directly the internal id, give same 
>> responsive results 
>> *Is there any way to speed up performance to reasonable production time?* 
>> (lower than 1s ... it means 3 orders of magnitude ... )
>>
>> MATCH (n) , (m), p = (n)-[*0..2]-(m)
>> where id(n) = 103105 and id(m) = 1386672
>> with p, n, m
>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
>> n.proximity) AS pathProximity order by pathProximity DESC;
>>
>> *~1M ms !!! *
>>
>>
>> same as
>> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m)
>> where n.name = 'title-1' and id(m) = 'title-2'
>> with p, n, m
>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
>> n.proximity) AS pathProximity order by pathProximity DESC;
>>
>> *~2M ms !!! *
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Reply via email to