[Neo4j] best practices for storing 40 millions of nodes

John Carlo Tue, 09 Oct 2018 14:00:37 -0700

Hello all, 

I've been using Neo4j for some weeks and I think it's awesome.


I'm building an NLP application, and basically, I'm using Neo4j for storing 
the dependency graph generated by a semantic parser, something like this:
https://explosion.ai/demos/displacy?text=Hi%20dear%2C%20what%20is%20your%20name%3F&model=en_core_web_sm&cpu=1&cph=0

In the nodes, I store the single words contained in the sentences, and I 
connect them through relations with a number of different types.

For my application, I have the requirement to find all the nodes that 
contain a given word, so basically I have to search through all the nodes, 
finding those that contain the input word.  Of course, I've already created 
an index on the word text field.

I'm working on a very big dataset (by the way, the CSV importer is a great 
thing). 

On my laptop, the following query takes about 20 ms
*MATCH (t:token) WHERE t.text="avoid" RETURN t.text*

Here are the details of the graph.db:
47.108.544 nodes

*45.442.034 relationships*

*13.39 GiB db size*
*Index created on token.text field*

PROFILE MATCH (t:token) WHERE t.text="switch" RETURN t.text
------------------------
NodeIndexSeek
251,679 db hits
---------------
Projection
251,678 db hits
--------------
ProduceResults
251,678 db hits

I wonder if I'm doing something wrong in indexing such amount of nodes. At 
the moment, I create a new node for each word I encounter in the text, even 
if the text is the same of other nodes.

Should I create a new node only when a new word is encountered, managing 
the sentence structures through relationships?

Could you please help me with a suggestion or best practice to adopt for 
this specific case? I think that Neo4j is a great piece of software and I'd 
like to make the most out of it :-)

Thank you very much 

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[Neo4j] best practices for storing 40 millions of nodes

Reply via email to