Re: [Neo4j] Best practices for storing 40 millions of nodes/words in Neo4j

Russ Burkert Tue, 09 Oct 2018 16:03:34 -0700

I advise doing some google searches on NLP database design.  That's a
fairly common problem and surely someone has documented best practices to
learn from others who have traveled the same road.


On Tue, Oct 9, 2018 at 2:00 PM John Carlo <johncarlof1...@gmail.com> wrote:

> Hello all,
>
> I've been using Neo4j for some weeks and I think it's awesome.
>
> I'm building an NLP application, and basically, I'm using Neo4j for
> storing the dependency graph generated by a semantic parser, something like
> this:
>
> https://explosion.ai/demos/displacy?text=Hi%20dear%2C%20what%20is%20your%20name%3F&model=en_core_web_sm&cpu=1&cph=0
>
> In the nodes, I store the text of the single words contained in the
> sentences, and I connect them through relations with a number of different
> types, depending on the semantic function (compound words, subject, adverb,
> and so forth)
>
> For my application, I have the requirement to find all the nodes that
> contain a given word, so basically I have to scan through all the nodes,
> finding those that contain the input word.  Of course, I've already created
> an index on the word text field.
>
> I'm working now on a very big dataset (by the way, the CSV importer is a
> great thing).
>
> On my laptop, the following query takes 22717 ms to execute
>
> *MATCH (t:token) *
> *WHERE t.text="switch" *
> *RETURN t.text*
>
> Here are the details of the graph.db:
>
> 47.108.544 nodes
> 45.442.034 edges
> *12.81 GiB size*
>
> Index created on token.text property
>
> Here are EXPLAIN and PROFILE results:
>
> *EXPLAIN MATCH (t:token) **WHERE t.text="switch" **RETURN t.text*
> ---------------------
> NodeIndexSeek
> 321  estimated rows
> -----------------------
> Projection
> 321 estimated rows
> ---------------------
> ProduceResults
> 321 estimated rows
> ---------------------
>
> *PROFILE MATCH (t:token) **WHERE t.text="switch" **RETURN t.text*
> ----------------
> NodeIndexSeek
> 251,680
> db hits
> ----------------
> Projection
> 251,680 db hits
> ----------------
> ProduceResults
> 251,680 db hits
> ----------------
>
> I wonder if I'm doing something wrong in indexing such amount of nodes. At
> the moment I create a node for each word I encounter in the dataset, even
> if the text is the same in two or more different nodes.
>
> Maybe should I create a new node only when a new word is encountered,
> managing the sentence structures through relationships?
>
> Could you please help me with a suggestion or best practice to adopt for
> this specific case? I think that Neo4j is a great piece of software and I'd
> like to make the most out of it :-)
>
> Thank you very much
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to neo4j+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Best practices for storing 40 millions of nodes/words in Neo4j

Reply via email to