Re: [Neo4j] Re: Approach for using labels during batch import

Mahesh Lal Sat, 17 Oct 2015 05:50:46 -0700

Hi,

Started following this thread a bit late, but the last bit about
denormalization caught my interest.


Denormalization might be the point in Document stores and possiblye
Key-Value, but in stores that allow storage and seamless retrieval of
data(like Graphs and RDBMS - not seamless), what Michael Hunger suggested
makes more sense.

The idea is to have a node lets say
(x:Songwriter)-[:LIVES_IN]->(:State{name:"Louisiana"}). The beauty of this
is that it can have multiple labels, like *:Individual*. This approach
however might not be useful when you want to find what all roles does
 (x:Individual) have. In case you foresee such queries, a better way model
it would be (x:Individual)-[:IS_A]->(:Role{name:"Songwriter"})
(x:Individual)-[:LIVES_IN]->(c:City)-[:LOCATED_IN]->(:State{name:"Louisiana"}).

Though I have never worked with RDF stores, and have a very limited
understanding of them from discussions with colleagues, I'm assuming the
limitation of RDF stores of having one "Type" makes the ontology deep.
Also, in suggesting the above, I'm assuming there is some way, in your use
case, to break the types into Labels, Nodes and Relationships.

Cheers!
Mahesh Lal

-- Thanks and Regards
   Mahesh Lal


On 17 October 2015 at 16:20, Michael B. <[email protected]> wrote:

> Yago has a ridiculously deep taxonomy. Most ontologies have several
> thousands of classes though; due to the nature of any RDF store out there.
> Traversal and property queries (in SPARQL) are complicated and very slow
> because lots of things are postfiltered (collect nodes first, filter by
> property later). Querying by class/type and relationships on the other hand
> is strongly optimized and very fast. That's why most ontologies have lots
> of classes (are multiclassing).
>
> Aside from that: isn't denormalization the main point of NoSQL stores?
> Although stuff like this shouldn't exist in a proper triple store; just
> found it in a yago sample data set and found it funny...
>
> Michael Hunger <[email protected]> schrieb am Sa., 17.
> Okt. 2015 11:11:
>
>> This looks scary like denomalization
>>
>> wikicat_Songwriters_from_Louisiana
>>
>> Shouldn't that be 3 nodes linked to it rather than a type node
>>
>> Von meinem iPhone gesendet
>>
>> Am 17.10.2015 um 11:04 schrieb Michael B. <[email protected]>:
>>
>> Yago has roughly 350,000 different classes, 10 million entities and 120
>> million facts (which would be either relationships or properties).
>>
>> As mentioned previously, I'd rather go with few labels are model entity
>> types as their own nodes (which is the case in RDF). You could query for it
>> with something like this:
>> match 
>> (x:Individual)-[t:is_a]->(c:Class{type:wikicat_Songwriters_from_Louisiana})
>> return x
>>
>> On 17 October 2015 at 10:13, Michael Hunger <
>> [email protected]> wrote:
>>
>>> How many different types?
>>>
>>> Von meinem iPhone gesendet
>>>
>>> Am 17.10.2015 um 06:38 schrieb Qi Song <[email protected]>:
>>>
>>> Each instance in Yago have a type, and there are millions instances.
>>>
>>> On Fri, Oct 16, 2015 at 3:26 PM, Michael Hunger <
>>> [email protected]> wrote:
>>>
>>>> Labels are roles or tags on nodes.
>>>>
>>>> Which can be used to represent types as well.
>>>>
>>>> That you can attach metadata like indexes is just a benefit.
>>>>
>>>> The is-a relationships might be fine on a theoretical model, but will
>>>> not perform that well if you have many millions or billions of them and
>>>> query across them.
>>>>
>>>> How many types are there in yago?
>>>>
>>>> Michael
>>>>
>>>> Am 16.10.2015 um 23:40 schrieb Michael Bach <[email protected]
>>>> >:
>>>>
>>>> I did a couple of experiments today. For all it's worth: the labels are
>>>> a means to index different document sets, since property indexes are built
>>>> on node label basis. I wouldn't try and introduce a label for each class in
>>>> yago. As mentioned before, I'd rather try and model is-a relationships with
>>>> nodes rather than labels.
>>>>
>>>> Is there a particular reason why you're trying your luck with neo4j
>>>> instead of virtuoso or jena?
>>>>
>>>> Von meinem iPad gesendet
>>>>
>>>> Am 15.10.2015 um 23:12 schrieb Qi Song <[email protected]>:
>>>>
>>>> Hi Michael,
>>>> Thanks for your reply :) I noticed that the code is old and use some
>>>> old APIs. However, the label is a bottleneck for loading RDF files. In my
>>>> work, the label is very important. I'll try to find some way to handle
>>>> labels more effective.
>>>>
>>>> Bests~
>>>> Qi Song
>>>>
>>>> On Thursday, October 15, 2015 at 2:07:08 PM UTC-7, Michael B. wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> My best guess would be that the algorithm neo4j uses is just can't
>>>>> cope with the vast amount of labels this sort of use case would produce.
>>>>> Anyhow, the code is very, very old...
>>>>> The better approach to this would be to actually model RDF-like
>>>>> relationships with nodes and introduce only a few labels for class,
>>>>> individual, maybe a couple data types.
>>>>>
>>>>> Von meinem iPad gesendet
>>>>>
>>>>> Am 15.10.2015 um 11:00 schrieb Qi Song <[email protected]>:
>>>>>
>>>>> Hello Michael,
>>>>> I try to use your Turtleloader to import Yago(
>>>>> https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/)
>>>>> into neo4j. But I met some weird problems when importing. I can import
>>>>> YagoFacts.ttl and YagoTypes.ttl well separably. But when I tried to import
>>>>> both of them I got this error. I'm not sure what's the reason. There is
>>>>> some limit for TurtleLoader or BatchImporter?
>>>>>
>>>>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>>>> at
>>>>> org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
>>>>> Caused by: java.lang.RuntimeException: Panic called, so exiting
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.assertHealthy(AbstractStep.java:200)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.staging.ProducerStep.process(ProducerStep.java:78)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.staging.ProducerStep$1.run(ProducerStep.java:54)
>>>>> Caused by: java.lang.IllegalArgumentException
>>>>> at sun.misc.Unsafe.allocateMemory(Native Method)
>>>>> at
>>>>> org.neo4j.unsafe.impl.internal.dragons.UnsafeUtil.malloc(UnsafeUtil.java:324)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.cache.OffHeapNumberArray.<init>(OffHeapNumberArray.java:41)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.cache.OffHeapLongArray.<init>(OffHeapLongArray.java:34)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.cache.NumberArrayFactory$2.newLongArray(NumberArrayFactory.java:122)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.cache.NumberArrayFactory$Auto.newLongArray(NumberArrayFactory.java:154)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.RelationshipCountsProcessor.<init>(RelationshipCountsProcessor.java:60)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.ProcessRelationshipCountsDataStep.processor(ProcessRelationshipCountsDataStep.java:73)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.ProcessRelationshipCountsDataStep.process(ProcessRelationshipCountsDataStep.java:60)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.ProcessRelationshipCountsDataStep.process(ProcessRelationshipCountsDataStep.java:36)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$4.run(ProcessorStep.java:120)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$4.run(ProcessorStep.java:102)
>>>>> at
>>>>> org.neo4j.unsafe.impl.batchimport.executor.DynamicTaskExecutor$Processor.run(DynamicTaskExecutor.java:237)
>>>>>
>>>>> Bests~
>>>>> Qi Song
>>>>>
>>>>> On Friday, June 7, 2013 at 1:35:26 AM UTC-7, Michael B. wrote:
>>>>>>
>>>>>> I checked that out in my batch importer (have a look at it on github).
>>>>>>
>>>>>> MapDB performs pretty good, but in the end, the index look-ups aren't
>>>>>>
>>>>>> the big bottleneck. If you need to make normal index operation at any
>>>>>>
>>>>>> point (to make sure you're not importing duplicates) or iterate over
>>>>>> relationships of nodes to create unique relationships, everything's
>>>>>> becoming way slower.
>>>>>>
>>>>>> As far as Batch imports go, I think an in-memory MapDB ist the best
>>>>>> option. You might want to include some kind of function to create an
>>>>>> in-memory index on specific Labels/keys to allow for fast access to
>>>>>> whatever's desired for batch loads.
>>>>>>
>>>>>> Here's what I did for Batch loads:
>>>>>>
>>>>>> https://github.com/mybyte/tools/blob/master/Turtle%20loader/src/de/miba/neo4j/loader/turtle/Neo4jMapDBBatchHandler.java
>>>>>>
>>>>>> The import went fine, pretty fast I'd say. The bigger problem is
>>>>>> overall performance on all the node operations...
>>>>>>
>>>>>> On Freitag, 7. Juni 2013 10:26:47, Michael Hunger wrote:
>>>>>> > Actually I want to update the CSV batch inserter to support index
>>>>>> > lookups and use real "csv" that means I'll put MapDB in there, we'll
>>>>>>
>>>>>> > see how it goes.
>>>>>> >
>>>>>> > You can also see if just a standard HashMap is good enough for you
>>>>>> or
>>>>>> > a Trove-primitive Map. Otherwise there is still that trick with the
>>>>>>
>>>>>> > array of unique values which you can sort and then use the array
>>>>>> index
>>>>>> > as node-id. inserter.createNode(index, props) and then the id-lookup
>>>>>>
>>>>>> > for rels is just Arrays.binarySearch(array, value)
>>>>>> >
>>>>>> > I also have to update the batch-importer to 2.0 but that's a bigger
>>>>>>
>>>>>> > piece of work. As lots of the internals changed in between.
>>>>>> >
>>>>>> > Michael
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Jun 7, 2013 at 10:10 AM, Michael B. <[email protected]
>>>>>> > <mailto:[email protected]>> wrote:
>>>>>> >
>>>>>> >     Michael Hunger has actually written a blog entry on this. Check
>>>>>>
>>>>>> >     his blog out: http://jexp.de/blog/
>>>>>> >
>>>>>> >     Standard Lucene performs poorly in many cases. The only thing
>>>>>> it's
>>>>>> >     good at is full text search with N-Gram. If you don't need that,
>>>>>>
>>>>>> >     any key-value storm performs better, e.g. MapDB or Voldemort.
>>>>>> >
>>>>>> >
>>>>>> >     On Freitag, 7. Juni 2013 07:41:34, Jennifer Smith wrote:
>>>>>> >
>>>>>> >         Hi Michael,
>>>>>> >
>>>>>> >         Yes I was considering using MapDB. We actually do use the
>>>>>> standard
>>>>>> >         lucene indexes during our existing 1.9x batch insertion. We
>>>>>>
>>>>>> >         also do a
>>>>>> >         pre-existing data check when inserting nodes and entities
>>>>>> that
>>>>>> >         uses
>>>>>> >         the index. So far it's been fast enough - by that I mean
>>>>>> >         taking 2/3
>>>>>> >         hours for about 50 million nodes, 90 million relationships!
>>>>>>
>>>>>> >         But when
>>>>>> >         we need more performance, I am happy to explore mapdb as an
>>>>>>
>>>>>> >         option at
>>>>>> >         import time. I would also probably be interested in using
>>>>>> this
>>>>>> >         as a
>>>>>> >         permanent index too, rather than just at import time.
>>>>>> >
>>>>>> >         Thanks
>>>>>> >
>>>>>> >         Jen
>>>>>> >
>>>>>> >         On Tuesday, 4 June 2013 14:31:59 UTC+1, Michael B. wrote:
>>>>>> >
>>>>>> >             Check out my blog entry on batch imports:
>>>>>> >         http://michaelbloggs.blogspot.
>>>>>> __com/2013/05/importing-ttl-__turtle-ontologies-in-neo4j.__html
>>>>>> >         <
>>>>>> http://michaelbloggs.blogspot.com/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html
>>>>>> >
>>>>>> >
>>>>>> >         <http://michaelbloggs.__
>>>>>> blogspot.com/2013/05/__importing-ttl-turtle-__ontologies-in-neo4j.html
>>>>>>
>>>>>> >         <
>>>>>> http://michaelbloggs.blogspot.com/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html
>>>>>> >>
>>>>>> >
>>>>>> >             Labels are a bit complicated. You shouldn't /commit /to
>>>>>>
>>>>>> >         indices
>>>>>> >
>>>>>> >             during batch imports (but you can add stuff to them) -
>>>>>> they'll
>>>>>> >             make everything incredibly slow. Michael Hunger
>>>>>> suggested
>>>>>> >         to use
>>>>>> >             MapDB as a temporary index. That's what I'd do in your
>>>>>> place.
>>>>>> >             Either do it like I did (for small data sets a HashMap
>>>>>> is more
>>>>>> >             than enough) and use a java.util.Map implementation +
>>>>>> index as
>>>>>> >             fallback for the nodes that are in the DB, but haven't
>>>>>> been
>>>>>> >             imported by your application or use a MapDB instead.
>>>>>> >
>>>>>> >             Regards,
>>>>>> >             Michael
>>>>>> >
>>>>>> >             On Tuesday, 4 June 2013 11:47:25 UTC+2, Jennifer Smith
>>>>>> wrote:
>>>>>> >
>>>>>> >                 Hi there,
>>>>>> >
>>>>>> >                 I have been looking at the docs for 2.0 particularly
>>>>>>
>>>>>> >         around
>>>>>> >                 support for labels during batch import.
>>>>>> >
>>>>>> >                 I see there is support for adding labels to nodes
>>>>>> >         during batch
>>>>>> >                 import, directly querying labels for nodes and so
>>>>>> on.
>>>>>> >         However,
>>>>>> >                 unless I am missing something I don't see that
>>>>>> there is
>>>>>> >                 support for locating a node by label and ID. I have
>>>>>>
>>>>>> >         found I
>>>>>> >                 have needed to do this when I import a large dataset
>>>>>>
>>>>>> >         where the
>>>>>> >                 relationships come separately from the nodes (say a
>>>>>>
>>>>>> >         dump from
>>>>>> >                 a relational database) and I need to use an
>>>>>> external ID to
>>>>>> >                 find the nodes for the relationship.
>>>>>> >
>>>>>> >                  I wondered what the intended approach for looking
>>>>>> up
>>>>>> >         a node
>>>>>> >                 by label and ID is during batch import. I can see
>>>>>> the
>>>>>> >                 following choices:
>>>>>> >
>>>>>> >                 - Use the standard EmbeddedGraphDatabase (making
>>>>>> sure
>>>>>> >         to have
>>>>>> >                 shut down the batch inserter of course) to look up
>>>>>> the
>>>>>> >         nodes
>>>>>> >                 for a bunch of relationship inserts before going
>>>>>> into
>>>>>> >         insert mode.
>>>>>> >                 - Use the BatchInserterIndexProvider to somehow hack
>>>>>>
>>>>>> >         into the
>>>>>> >                 underlying index that I believe is created for
>>>>>> labels
>>>>>> >                 - Be patient and wait for support to appear in the
>>>>>> >         batch API
>>>>>> >                 for querying nodes by label and ID :)
>>>>>> >
>>>>>> >                 Thanks
>>>>>> >
>>>>>> >                 Jen
>>>>>> >
>>>>>> >         --
>>>>>> >         You received this message because you are subscribed to a
>>>>>> >         topic in the
>>>>>> >         Google Groups "Neo4j" group.
>>>>>> >         To unsubscribe from this topic, visit
>>>>>> >
>>>>>> https://groups.google.com/d/__topic/neo4j/eq_2fD2BlQU/__unsubscribe?hl=en
>>>>>>
>>>>>> >         <
>>>>>> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe?hl=en
>>>>>> >.
>>>>>> >         To unsubscribe from this group and all its topics, send an
>>>>>> >         email to
>>>>>> >
>>>>>> >         neo4j+unsubscribe@__googlegroups.com
>>>>>> >         <mailto:neo4j%[email protected]>.
>>>>>> >         For more options, visit
>>>>>> >         https://groups.google.com/__groups/opt_out
>>>>>> >         <https://groups.google.com/groups/opt_out>.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >     --
>>>>>> >     You received this message because you are subscribed to the
>>>>>> Google
>>>>>> >     Groups "Neo4j" group.
>>>>>> >     To unsubscribe from this group and stop receiving emails from
>>>>>> it,
>>>>>> >     send an email to neo4j+unsubscribe@__googlegroups.com
>>>>>> >     <mailto:neo4j%[email protected]>.
>>>>>> >     For more options, visit
>>>>>> https://groups.google.com/__groups/opt_out
>>>>>> >     <https://groups.google.com/groups/opt_out>.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > You received this message because you are subscribed to a topic in
>>>>>> the
>>>>>> > Google Groups "Neo4j" group.
>>>>>> > To unsubscribe from this topic, visit
>>>>>> >
>>>>>> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe?hl=en
>>>>>> .
>>>>>> > To unsubscribe from this group and all its topics, send an email to
>>>>>>
>>>>>> > [email protected].
>>>>>> > For more options, visit https://groups.google.com/groups/opt_out.
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "Neo4j" group.
>>>>> To unsubscribe from this topic, visit
>>>>> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>> --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "Neo4j" group.
>>>> To unsubscribe from this topic, visit
>>>> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "Neo4j" group.
>>>> To unsubscribe from this topic, visit
>>>> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> Qi Song
>>> Machine learning and Knowledge Discovery Group
>>> EECS Washington State University
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "Neo4j" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Neo4j" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Approach for using labels during batch import

Reply via email to