[Neo4j] Re: neo4j-import non-deterministically corrupts a few node ids

Zongheng Yang Mon, 15 Jun 2015 12:00:22 -0700

Hi Mattias,

Thanks for looking into this.  I understand the difference between Neo4j 
internal ids vs. the ids supplied in the csv.


However for say GraphDatabaseService#getNodeById(long id), does this 
function take the user-supplied ids or Neo4j's internal ids?

If it is the former: then the conceptual mismatch doesn't fully explain the 
problem (e.g. I queried the nodes/edges using user-supplied ids, and the 
internal ids should not mess up with the query results).  If it is the 
latter, then for users programming using the Java Core API, how should they 
get these correct internal ids (they only know application-supplied ids).

Best,
Zongheng

On Monday, June 15, 2015 at 5:23:24 AM UTC-7, Mattias Persson wrote:
>
> Hello again, I'm quite confident I know what's happening here. The problem 
> is the misconception that your INTEGER ids defined in the csv files will 
> map 1-to-1 to the neo4j node/relationship ids in the store. They will 
> actually match in most cases, but that's merely a coincidence.
>
> What you're seeing is the result of some parallelism happening in the 
> importer where batches of 10k nodes/relationships flows through different 
> steps, where some steps may execute multiple batches in parallel and 
> doesn't care if reordering happens. Ids are assigned at the end.
>
> You're looking at the ids and see that they mismatch, but if you look at 
> their data you should see that all relationships match the csv files. So 
> please disregard the seemingly close match of neo4j node/relationship ids 
> with the csv input ids as they are quite different in nature.
>
> On Thursday, June 11, 2015 at 11:32:55 AM UTC+2, Mattias Persson wrote:
>>
>> Hi, I'm one of the main authors of the import tool and I find this issue 
>> quite interesting.
>>
>> Would you be able to share your dataset with me personally, for the 
>> single purpose of trying to find the root cause?
>>
>> On Friday, June 5, 2015 at 5:12:43 AM UTC+2, Zongheng Yang wrote:
>>>
>>> Hi all,
>>>
>>> I'm using neo4j-import to import nodes and relationships from csv files. 
>>> Let's say node id 538398 has about 100 edges and
>>>
>>> 538398 -> 370047
>>> 538398 -> 379981
>>>
>>> are just two of them.  After the import, the neo4j database actually 
>>>
>>> - *loses* these two edges
>>> - instead *corrupts* the destination ids, as follows
>>>
>>>     538398 -> 380047
>>>     538398 -> 389981
>>>
>>> - *keeps* all other outgoing edges of 538398 correct
>>>
>>> The problem seems to be non-deterministic: doing a `rm -rf dbPath` and 
>>> re-running neo4j-import seems to fix the issue, for this particular node -- 
>>> but I've not done extensive tests to see whether other nodes get corrupted 
>>> in this way.
>>>
>>> Has anyone seen this before? The graph has on the order of 1 million 
>>> node, average degree 40. 
>>>
>>> Zongheng
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Re: neo4j-import non-deterministically corrupts a few node ids

Reply via email to