Re: [Neo4j] best practices for storing 40 millions of nodes

John Carlo Wed, 10 Oct 2018 06:42:18 -0700

It's a custom dataset generated from a series of documents in XML format, 
translated in CSV and the imported in Neo4j via the built-in CSV loader



Il giorno mercoledì 10 ottobre 2018 09:46:39 UTC+2, Sakshi Srivastva ha 
scritto:
>
> CAN YOU PLEASE TELL ME WHICH DATA SET YOU ARE USING.
>
> On Tue, Oct 9, 2018 at 3:50 PM 'Michael Hunger' via Neo4j <
> ne...@googlegroups.com <javascript:>> wrote:
>
>> Yes, I would only create every word node once. And then link the sentence 
>> structures.
>> In general, just finding all the word nodes is probably not your end-goal 
>> or?
>>
>> Best ask here Community Site & Forum <https://community.neo4j.com> in 
>> the Modeling and Cypher categories.
>>
>>
>> On Tue, Oct 9, 2018 at 11:00 PM John Carlo <johncar...@gmail.com 
>> <javascript:>> wrote:
>>
>>> Hello all, 
>>>
>>> I've been using Neo4j for some weeks and I think it's awesome. 
>>>
>>> I'm building an NLP application, and basically, I'm using Neo4j for 
>>> storing the dependency graph generated by a semantic parser, something like 
>>> this:
>>>
>>> https://explosion.ai/demos/displacy?text=Hi%20dear%2C%20what%20is%20your%20name%3F&model=en_core_web_sm&cpu=1&cph=0
>>>
>>> In the nodes, I store the single words contained in the sentences, and I 
>>> connect them through relations with a number of different types.
>>>
>>> For my application, I have the requirement to find all the nodes that 
>>> contain a given word, so basically I have to search through all the nodes, 
>>> finding those that contain the input word.  Of course, I've already created 
>>> an index on the word text field.
>>>
>>> I'm working on a very big dataset (by the way, the CSV importer is a 
>>> great thing). 
>>>
>>> On my laptop, the following query takes about 20 ms
>>> *MATCH (t:token) WHERE t.text="avoid" RETURN t.text*
>>>
>>> Here are the details of the graph.db:
>>> 47.108.544 nodes
>>>
>>> *45.442.034 relationships*
>>>
>>> *13.39 GiB db size*
>>> *Index created on token.text field*
>>>
>>> PROFILE MATCH (t:token) WHERE t.text="switch" RETURN t.text
>>> ------------------------
>>> NodeIndexSeek
>>> 251,679 db hits
>>> ---------------
>>> Projection
>>> 251,678 db hits
>>> --------------
>>> ProduceResults
>>> 251,678 db hits
>>>
>>> I wonder if I'm doing something wrong in indexing such amount of nodes. 
>>> At the moment, I create a new node for each word I encounter in the text, 
>>> even if the text is the same of other nodes.
>>>
>>> Should I create a new node only when a new word is encountered, managing 
>>> the sentence structures through relationships?
>>>
>>> Could you please help me with a suggestion or best practice to adopt for 
>>> this specific case? I think that Neo4j is a great piece of software and I'd 
>>> like to make the most out of it :-)
>>>
>>> Thank you very much 
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to neo4j+un...@googlegroups.com <javascript:>.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to neo4j+un...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] best practices for storing 40 millions of nodes

Reply via email to