[Neo4j] Best practices for storing 40 millions of nodes/words in Neo4j

John Carlo Tue, 09 Oct 2018 14:00:23 -0700

Hello all, 

I've been using Neo4j for some weeks and I think it's awesome.


I'm building an NLP application, and basically, I'm using Neo4j for storing 
the dependency graph generated by a semantic parser, something like this:
https://explosion.ai/demos/displacy?text=Hi%20dear%2C%20what%20is%20your%20name%3F&model=en_core_web_sm&cpu=1&cph=0

In the nodes, I store the text of the single words contained in the 
sentences, and I connect them through relations with a number of different 
types, depending on the semantic function (compound words, subject, adverb, 
and so forth)

For my application, I have the requirement to find all the nodes that 
contain a given word, so basically I have to scan through all the nodes, 
finding those that contain the input word.  Of course, I've already created 
an index on the word text field.

I'm working now on a very big dataset (by the way, the CSV importer is a 
great thing). 

On my laptop, the following query takes 22717 ms to execute

*MATCH (t:token) *
*WHERE t.text="switch" *
*RETURN t.text*

Here are the details of the graph.db:

47.108.544 nodes
45.442.034 edges
*12.81 GiB size*

Index created on token.text property

Here are EXPLAIN and PROFILE results:

*EXPLAIN MATCH (t:token) **WHERE t.text="switch" **RETURN t.text*
---------------------
NodeIndexSeek
321  estimated rows
-----------------------
Projection
321 estimated rows
---------------------
ProduceResults
321 estimated rows
---------------------

*PROFILE MATCH (t:token) **WHERE t.text="switch" **RETURN t.text*
----------------
NodeIndexSeek
251,680 
db hits
----------------
Projection
251,680 db hits
----------------
ProduceResults
251,680 db hits
----------------

I wonder if I'm doing something wrong in indexing such amount of nodes. At 
the moment I create a node for each word I encounter in the dataset, even 
if the text is the same in two or more different nodes.

Maybe should I create a new node only when a new word is encountered, 
managing the sentence structures through relationships?

Could you please help me with a suggestion or best practice to adopt for 
this specific case? I think that Neo4j is a great piece of software and I'd 
like to make the most out of it :-)

Thank you very much 


-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Best practices for storing 40 millions of nodes/words in Neo4j

Reply via email to