Re: [Neo4j] Re: neo4j-import non-deterministically corrupts a few node ids

Zongheng Yang Sun, 13 Sep 2015 10:16:06 -0700

Michael, thanks for chiming in.

This turned out to be a mistake of the ETL process using outdated input.
I'm using 2.2.2; is there any critical fix in newer versions?


On Sat, Sep 12, 2015 at 3:24 PM Michael Hunger <
[email protected]> wrote:

> It means that you don't have id's which are huge, e.g 100M or 5bn while
> just having a few nodes. Then the store-file would grow to accommodate the
> huge record-id.
>
> Which version are you on? Afaik Mattias fixed an issue in that area?
>
> Michael
>
> Am 12.09.2015 um 23:05 schrieb Zongheng Yang <[email protected]>:
>
> I think I got hit by this issue again, on a different dataset.
>
> Mattias / Michael, could you clarify that what "without large holes in the
> distribution" precisely means?
>
> My node csv has a line for each node, and line K (0-indexed) uniquely
> corresponds to data of node K (0-indexed).  There are exactly as many
> number of lines as the number of nodes in the graph.  So it should respect
> this property.
>
> However, for the edge csv, does it have to satisfy any special property?
>
> On Tue, Jun 16, 2015 at 4:56 AM Mattias Persson <[email protected]>
> wrote:
>
>> Yes, I agree --id-type ACTUAL will guarantee this constraint.
>>
>>
>> On Monday, June 15, 2015 at 9:43:38 PM UTC+2, Zongheng Yang wrote:
>>>
>>> Fantastic, in my case the ids are exactly the sequence [0, 1, ..., N]
>>> without gaps, unique, and in that order.
>>>
>>> Thanks both of you for the help!
>>>
>>> On Monday, June 15, 2015 at 12:34:18 PM UTC-7, Michael Hunger wrote:
>>>>
>>>> No, --id-type actual
>>>> would but then you have to make sure to have globally unique
>>>> incrementing id's without large holes in the distribution.
>>>>
>>>>
>>>> Am 15.06.2015 um 21:31 schrieb Zongheng Yang <[email protected]>:
>>>>
>>>> I see.  Would setting the `--processors 1` flag for neo4j-import make
>>>> internal ids and external ids match in my case?  (I understand this is an
>>>> implementation detail and not a user-facing property.)
>>>>
>>>> On Monday, June 15, 2015 at 12:07:56 PM UTC-7, Michael Hunger wrote:
>>>>>
>>>>> GraphDatabaseService#getNodeById(long id)
>>>>>
>>>>>
>>>>> takes Neo4j internal ids.
>>>>>
>>>>> Michael
>>>>>
>>>>> Am 15.06.2015 um 20:59 schrieb Zongheng Yang <[email protected]>:
>>>>>
>>>>> Hi Mattias,
>>>>>
>>>>> Thanks for looking into this.  I understand the difference between
>>>>> Neo4j internal ids vs. the ids supplied in the csv.
>>>>>
>>>>> However for say GraphDatabaseService#getNodeById(long id), does this
>>>>> function take the user-supplied ids or Neo4j's internal ids?
>>>>>
>>>>> If it is the former: then the conceptual mismatch doesn't fully
>>>>> explain the problem (e.g. I queried the nodes/edges using user-supplied
>>>>> ids, and the internal ids should not mess up with the query results).  If
>>>>> it is the latter, then for users programming using the Java Core API, how
>>>>> should they get these correct internal ids (they only know
>>>>> application-supplied ids).
>>>>>
>>>>> Best,
>>>>> Zongheng
>>>>>
>>>>> On Monday, June 15, 2015 at 5:23:24 AM UTC-7, Mattias Persson wrote:
>>>>>>
>>>>>> Hello again, I'm quite confident I know what's happening here. The
>>>>>> problem is the misconception that your INTEGER ids defined in the csv 
>>>>>> files
>>>>>> will map 1-to-1 to the neo4j node/relationship ids in the store. They 
>>>>>> will
>>>>>> actually match in most cases, but that's merely a coincidence.
>>>>>>
>>>>>> What you're seeing is the result of some parallelism happening in the
>>>>>> importer where batches of 10k nodes/relationships flows through different
>>>>>> steps, where some steps may execute multiple batches in parallel and
>>>>>> doesn't care if reordering happens. Ids are assigned at the end.
>>>>>>
>>>>>> You're looking at the ids and see that they mismatch, but if you look
>>>>>> at their data you should see that all relationships match the csv files. 
>>>>>> So
>>>>>> please disregard the seemingly close match of neo4j node/relationship ids
>>>>>> with the csv input ids as they are quite different in nature.
>>>>>>
>>>>>> On Thursday, June 11, 2015 at 11:32:55 AM UTC+2, Mattias Persson
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi, I'm one of the main authors of the import tool and I find this
>>>>>>> issue quite interesting.
>>>>>>>
>>>>>>> Would you be able to share your dataset with me personally, for the
>>>>>>> single purpose of trying to find the root cause?
>>>>>>>
>>>>>>> On Friday, June 5, 2015 at 5:12:43 AM UTC+2, Zongheng Yang wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'm using neo4j-import to import nodes and relationships from csv
>>>>>>>> files. Let's say node id 538398 has about 100 edges and
>>>>>>>>
>>>>>>>> 538398 -> 370047
>>>>>>>> 538398 -> 379981
>>>>>>>>
>>>>>>>> are just two of them.  After the import, the neo4j database
>>>>>>>> actually
>>>>>>>>
>>>>>>>> - *loses* these two edges
>>>>>>>> - instead *corrupts* the destination ids, as follows
>>>>>>>>
>>>>>>>>     538398 -> 380047
>>>>>>>>     538398 -> 389981
>>>>>>>>
>>>>>>>> - *keeps* all other outgoing edges of 538398 correct
>>>>>>>>
>>>>>>>> The problem seems to be non-deterministic: doing a `rm -rf dbPath`
>>>>>>>> and re-running neo4j-import seems to fix the issue, for this particular
>>>>>>>> node -- but I've not done extensive tests to see whether other nodes 
>>>>>>>> get
>>>>>>>> corrupted in this way.
>>>>>>>>
>>>>>>>> Has anyone seen this before? The graph has on the order of 1
>>>>>>>> million node, average degree 40.
>>>>>>>>
>>>>>>>> Zongheng
>>>>>>>>
>>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Neo4j" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Neo4j" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/neo4j/5k0xY6B1vtA/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
>
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Neo4j" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/neo4j/5k0xY6B1vtA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: neo4j-import non-deterministically corrupts a few node ids

Reply via email to