Re: [Neo4j] Re: neo4j-import non-deterministically corrupts a few node ids

Zongheng Yang Sat, 12 Sep 2015 14:06:07 -0700

I think I got hit by this issue again, on a different dataset.

Mattias / Michael, could you clarify that what "without large holes in the
distribution" precisely means?


My node csv has a line for each node, and line K (0-indexed) uniquely
corresponds to data of node K (0-indexed).  There are exactly as many
number of lines as the number of nodes in the graph.  So it should respect
this property.

However, for the edge csv, does it have to satisfy any special property?

On Tue, Jun 16, 2015 at 4:56 AM Mattias Persson <[email protected]>
wrote:

> Yes, I agree --id-type ACTUAL will guarantee this constraint.
>
>
> On Monday, June 15, 2015 at 9:43:38 PM UTC+2, Zongheng Yang wrote:
>>
>> Fantastic, in my case the ids are exactly the sequence [0, 1, ..., N]
>> without gaps, unique, and in that order.
>>
>> Thanks both of you for the help!
>>
>> On Monday, June 15, 2015 at 12:34:18 PM UTC-7, Michael Hunger wrote:
>>>
>>> No, --id-type actual
>>> would but then you have to make sure to have globally unique
>>> incrementing id's without large holes in the distribution.
>>>
>>>
>>> Am 15.06.2015 um 21:31 schrieb Zongheng Yang <[email protected]>:
>>>
>>> I see.  Would setting the `--processors 1` flag for neo4j-import make
>>> internal ids and external ids match in my case?  (I understand this is an
>>> implementation detail and not a user-facing property.)
>>>
>>> On Monday, June 15, 2015 at 12:07:56 PM UTC-7, Michael Hunger wrote:
>>>>
>>>> GraphDatabaseService#getNodeById(long id)
>>>>
>>>>
>>>> takes Neo4j internal ids.
>>>>
>>>> Michael
>>>>
>>>> Am 15.06.2015 um 20:59 schrieb Zongheng Yang <[email protected]>:
>>>>
>>>> Hi Mattias,
>>>>
>>>> Thanks for looking into this.  I understand the difference between
>>>> Neo4j internal ids vs. the ids supplied in the csv.
>>>>
>>>> However for say GraphDatabaseService#getNodeById(long id), does this
>>>> function take the user-supplied ids or Neo4j's internal ids?
>>>>
>>>> If it is the former: then the conceptual mismatch doesn't fully explain
>>>> the problem (e.g. I queried the nodes/edges using user-supplied ids, and
>>>> the internal ids should not mess up with the query results).  If it is the
>>>> latter, then for users programming using the Java Core API, how should they
>>>> get these correct internal ids (they only know application-supplied ids).
>>>>
>>>> Best,
>>>> Zongheng
>>>>
>>>> On Monday, June 15, 2015 at 5:23:24 AM UTC-7, Mattias Persson wrote:
>>>>>
>>>>> Hello again, I'm quite confident I know what's happening here. The
>>>>> problem is the misconception that your INTEGER ids defined in the csv 
>>>>> files
>>>>> will map 1-to-1 to the neo4j node/relationship ids in the store. They will
>>>>> actually match in most cases, but that's merely a coincidence.
>>>>>
>>>>> What you're seeing is the result of some parallelism happening in the
>>>>> importer where batches of 10k nodes/relationships flows through different
>>>>> steps, where some steps may execute multiple batches in parallel and
>>>>> doesn't care if reordering happens. Ids are assigned at the end.
>>>>>
>>>>> You're looking at the ids and see that they mismatch, but if you look
>>>>> at their data you should see that all relationships match the csv files. 
>>>>> So
>>>>> please disregard the seemingly close match of neo4j node/relationship ids
>>>>> with the csv input ids as they are quite different in nature.
>>>>>
>>>>> On Thursday, June 11, 2015 at 11:32:55 AM UTC+2, Mattias Persson wrote:
>>>>>>
>>>>>> Hi, I'm one of the main authors of the import tool and I find this
>>>>>> issue quite interesting.
>>>>>>
>>>>>> Would you be able to share your dataset with me personally, for the
>>>>>> single purpose of trying to find the root cause?
>>>>>>
>>>>>> On Friday, June 5, 2015 at 5:12:43 AM UTC+2, Zongheng Yang wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm using neo4j-import to import nodes and relationships from csv
>>>>>>> files. Let's say node id 538398 has about 100 edges and
>>>>>>>
>>>>>>> 538398 -> 370047
>>>>>>> 538398 -> 379981
>>>>>>>
>>>>>>> are just two of them.  After the import, the neo4j database actually
>>>>>>>
>>>>>>> - *loses* these two edges
>>>>>>> - instead *corrupts* the destination ids, as follows
>>>>>>>
>>>>>>>     538398 -> 380047
>>>>>>>     538398 -> 389981
>>>>>>>
>>>>>>> - *keeps* all other outgoing edges of 538398 correct
>>>>>>>
>>>>>>> The problem seems to be non-deterministic: doing a `rm -rf dbPath`
>>>>>>> and re-running neo4j-import seems to fix the issue, for this particular
>>>>>>> node -- but I've not done extensive tests to see whether other nodes get
>>>>>>> corrupted in this way.
>>>>>>>
>>>>>>> Has anyone seen this before? The graph has on the order of 1 million
>>>>>>> node, average degree 40.
>>>>>>>
>>>>>>> Zongheng
>>>>>>>
>>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Neo4j" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/neo4j/5k0xY6B1vtA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: neo4j-import non-deterministically corrupts a few node ids

Reply via email to