Re: [Neo4j] best practices for storing 40 millions of nodes

Russ Burkert Wed, 10 Oct 2018 19:36:40 -0700

The link:
https://tbgraph.wordpress.com/2018/09/09/article-recommendation-system-on-a-citation-network-using-personalized-pagerank-and-neo4j/


has some good info on working with NLP graphs

On Wed, Oct 10, 2018 at 2:41 PM John Carlo <johncarlof1...@gmail.com> wrote:

> You could start with the 20 newsgroups dataset
> http://qwone.com/~jason/20Newsgroups/
>
> Il giorno mercoledì 10 ottobre 2018 17:42:37 UTC+2, Sakshi Srivastva ha
> scritto:
>>
>> Sir, i am in search of a data set in which i can find hidden facts like
>> panama leak ,please suggest me similar big data set .
>>
>> On Wed, Oct 10, 2018 at 7:34 PM John Carlo <johncar...@gmail.com> wrote:
>>
>>> Hello Michael,
>>>
>>> thank your for your reply.
>>>
>>> I've re-implemented the db structure using unique words/nodes, now the
>>> number of nodes dropped from 47.108.544 to 1.934.049
>>>
>>> I still have a huge number of relationships, 45.442.034 that now point
>>> to the unique nodes, and the query are slow.
>>>
>>> My end goal is to find specific patterns in sentence structures, like
>>> the following example
>>>
>>> (John)-[ACTION ]->(eat)-[SUBJECT]->(apple)
>>>
>>> Any suggestion will be appreciated
>>>
>>> thank you very much
>>>
>>> Il giorno mercoledì 10 ottobre 2018 00:50:22 UTC+2, Michael Hunger ha
>>> scritto:
>>>>
>>>> Yes, I would only create every word node once. And then link the
>>>> sentence structures.
>>>> In general, just finding all the word nodes is probably not your
>>>> end-goal or?
>>>>
>>>> Best ask here Community Site & Forum <https://community.neo4j.com> in
>>>> the Modeling and Cypher categories.
>>>>
>>>>
>>>> On Tue, Oct 9, 2018 at 11:00 PM John Carlo <johncar...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I've been using Neo4j for some weeks and I think it's awesome.
>>>>>
>>>>> I'm building an NLP application, and basically, I'm using Neo4j for
>>>>> storing the dependency graph generated by a semantic parser, something 
>>>>> like
>>>>> this:
>>>>>
>>>>> https://explosion.ai/demos/displacy?text=Hi%20dear%2C%20what%20is%20your%20name%3F&model=en_core_web_sm&cpu=1&cph=0
>>>>>
>>>>> In the nodes, I store the single words contained in the sentences, and
>>>>> I connect them through relations with a number of different types.
>>>>>
>>>>> For my application, I have the requirement to find all the nodes that
>>>>> contain a given word, so basically I have to search through all the nodes,
>>>>> finding those that contain the input word.  Of course, I've already 
>>>>> created
>>>>> an index on the word text field.
>>>>>
>>>>> I'm working on a very big dataset (by the way, the CSV importer is a
>>>>> great thing).
>>>>>
>>>>> On my laptop, the following query takes about 20 ms
>>>>> *MATCH (t:token) WHERE t.text="avoid" RETURN t.text*
>>>>>
>>>>> Here are the details of the graph.db:
>>>>> 47.108.544 nodes
>>>>>
>>>>> *45.442.034 relationships*
>>>>>
>>>>> *13.39 GiB db size*
>>>>> *Index created on token.text field*
>>>>>
>>>>> PROFILE MATCH (t:token) WHERE t.text="switch" RETURN t.text
>>>>> ------------------------
>>>>> NodeIndexSeek
>>>>> 251,679 db hits
>>>>> ---------------
>>>>> Projection
>>>>> 251,678 db hits
>>>>> --------------
>>>>> ProduceResults
>>>>> 251,678 db hits
>>>>>
>>>>> I wonder if I'm doing something wrong in indexing such amount of
>>>>> nodes. At the moment, I create a new node for each word I encounter in the
>>>>> text, even if the text is the same of other nodes.
>>>>>
>>>>> Should I create a new node only when a new word is encountered,
>>>>> managing the sentence structures through relationships?
>>>>>
>>>>> Could you please help me with a suggestion or best practice to adopt
>>>>> for this specific case? I think that Neo4j is a great piece of software 
>>>>> and
>>>>> I'd like to make the most out of it :-)
>>>>>
>>>>> Thank you very much
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Neo4j" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to neo4j+un...@googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to neo4j+un...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to neo4j+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to neo4j+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] best practices for storing 40 millions of nodes

Reply via email to