I advise doing some google searches on NLP database design. That's a fairly common problem and surely someone has documented best practices to learn from others who have traveled the same road.
On Tue, Oct 9, 2018 at 2:00 PM John Carlo <johncarlof1...@gmail.com> wrote: > Hello all, > > I've been using Neo4j for some weeks and I think it's awesome. > > I'm building an NLP application, and basically, I'm using Neo4j for > storing the dependency graph generated by a semantic parser, something like > this: > > https://explosion.ai/demos/displacy?text=Hi%20dear%2C%20what%20is%20your%20name%3F&model=en_core_web_sm&cpu=1&cph=0 > > In the nodes, I store the text of the single words contained in the > sentences, and I connect them through relations with a number of different > types, depending on the semantic function (compound words, subject, adverb, > and so forth) > > For my application, I have the requirement to find all the nodes that > contain a given word, so basically I have to scan through all the nodes, > finding those that contain the input word. Of course, I've already created > an index on the word text field. > > I'm working now on a very big dataset (by the way, the CSV importer is a > great thing). > > On my laptop, the following query takes 22717 ms to execute > > *MATCH (t:token) * > *WHERE t.text="switch" * > *RETURN t.text* > > Here are the details of the graph.db: > > 47.108.544 nodes > 45.442.034 edges > *12.81 GiB size* > > Index created on token.text property > > Here are EXPLAIN and PROFILE results: > > *EXPLAIN MATCH (t:token) **WHERE t.text="switch" **RETURN t.text* > --------------------- > NodeIndexSeek > 321 estimated rows > ----------------------- > Projection > 321 estimated rows > --------------------- > ProduceResults > 321 estimated rows > --------------------- > > *PROFILE MATCH (t:token) **WHERE t.text="switch" **RETURN t.text* > ---------------- > NodeIndexSeek > 251,680 > db hits > ---------------- > Projection > 251,680 db hits > ---------------- > ProduceResults > 251,680 db hits > ---------------- > > I wonder if I'm doing something wrong in indexing such amount of nodes. At > the moment I create a node for each word I encounter in the dataset, even > if the text is the same in two or more different nodes. > > Maybe should I create a new node only when a new word is encountered, > managing the sentence structures through relationships? > > Could you please help me with a suggestion or best practice to adopt for > this specific case? I think that Neo4j is a great piece of software and I'd > like to make the most out of it :-) > > Thank you very much > > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to neo4j+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.