Re: [Neo4j] Duplicate Nodes

Alx Mon, 24 Feb 2014 11:47:38 -0800

Hi Kenny, 

Thank you for the post. It helped me with my duplicates removal. However 
when I try to run Step 2 I get the following error:


Invalid input 'C': expected whitespace, an unsigned integer, a parameter or '*' 
(line 9, column 14)
"START k=node(CommaSeparatedListOfIds)"


I replaced the last sentence of Step 1 with the following:

WITH COLLECT(DuplicateId) as CommaSeparatedListOfIds 

to connect to the Step 2 query.

Any ideas? Am I missing something? Thanks a lot!

On Wednesday, August 14, 2013 12:49:53 PM UTC-4, Kenny Bastani wrote:
>
> Hey all,
>
> The following 3 steps will allow you to using the admin console and Cypher 
> query to select and delete duplicate nodes by index and property.
>
> *STEP 1 (COLLECT duplicate nodes by ID):*
>
> START n=node:search("search:(\"TEXT MINING\")") // Cypher query for 
> collecting the ids of indexed nodes containing duplicate properties
> WITH n
> ORDER BY id(n) DESC  // Order by descending to delete the most recent 
> duplicated record
> WITH n.Key? as DuplicateKey, COUNT(n) as ColCount, COLLECT(id(n)) as 
> ColNode
> WITH DuplicateKey, ColCount, ColNode, HEAD(ColNode) as DuplicateId
> WHERE ColCount > 1 AND (DuplicateKey is not null) AND (DuplicateId is not 
> null)
> WITH DuplicateKey, ColCount, ColNode, DuplicateId 
> ORDER BY DuplicateId 
> RETURN DuplicateKey, ColCount, DuplicateId // Toggle comment on/off for 
> validating duplicate records before moving to next step (do not proceed to 
> delete without validating)
> //RETURN COLLECT(DuplicateId) as CommaSeparatedListOfIds
>
> *Example output - validation screenshot:*
>
>
> <https://lh6.googleusercontent.com/-fLaTfRuWS4M/UguyfiZs2LI/AAAAAAAAA2Y/bP9-dHFZtaM/s1600/validate-duplicates.PNG>
>
> *Example output - collected IDs that will be used in the next step:*
>
>
> <https://lh5.googleusercontent.com/-yAsElzgm64w/UguywmR7rpI/AAAAAAAAA2g/bK0dtatJR4o/s1600/comma-separated-list-id-duplicates.PNG>
>
> *STEP 2 (DELETE duplicate nodes and connected relationships by ID):*
>
> START n=node(
> 1120038,1120039,1120040,1120042,1120044,1120048,1120049,1120050,1120053,1120067,1120068
> ) // Replace highlighted ids with CommaSeparatedListOfIds from step 1
> MATCH n-[r]-() // Delete relationships for each duplicate before deleting 
> the node
> DELETE r, n
>
> *STEP 3:*
>
> Re-run the query from the first step to make sure that the duplicate nodes 
> were deleted
>
> Thanks,
>
> Kenny
>
> Follow me on Twitter: http://www.twitter.com/kennybastani
>
> On Monday, January 28, 2013 2:36:38 AM UTC-5, Michael Hunger wrote:
>>
>> There is no automatic way of doing this in Neo4j.
>> Usually that's a concern handled in application level code.
>>
>> You could for instance write a cypher statement that queries the 
>> relevant-sub-graph of the node, returns the identifying properties.
>>
>> Then hash the result rows and set them as a property on your 
>> aggregate-root node for future comparison. Might even be indexed.
>> (Instead of hashing they could also be concatinated in a sensible way 
>> (e.g. path/to/property:value, ...)
>>
>> When inserting new data you can check for that hash and if matched deep 
>> check the subgraph with a dedicated cypher query (can all be in one query).
>>
>> If the data/structure is not there, create it.
>>
>> Michael
>>
>> Am 28.01.2013 um 00:42 schrieb Rory Madden <[email protected]>:
>>
>> Unique indexes are great to prevent duplicates being created but is there 
>> a suggested approach for identifying duplicates across the graph. 
>>
>> E.g. with the movie graph if you don't implement unique indexes (because 
>> movies can have the same name) and two people create the same movie with 
>> the same actors is there a way to identify the duplication?
>>
>> Solr has MD5Signature, Lookup3Signature, TextProfileSignature algoritms 
>> for detecting duplicates but are there any examples of people using similar 
>> algorithms with Neo4j and the Lucene instance?
>>
>> Thanks,
>> Rory
>>
>> On Saturday, 21 January 2012 06:37:28 UTC-3, Peter Neubauer wrote:
>>>
>>> You mean creating entries only if they exist?
>>>
>>> That came in 1.6.M03, see
>>> http://docs.neo4j.org/chunked/snapshot/rest-api-unique-indexes.html
>>> and 
>>> http://components.neo4j.org/neo4j/1.6.M03/apidocs/org/neo4j/graphdb/index/Index.html#putIfAbsent%28T,%20java.lang.String,%20java.lang.Object%29
>>>
>>> Cheers,
>>>
>>> /peter neubauer
>>>
>>> Google:    neubauer.peter
>>> Skype:     peter.neubauer
>>> Phone:     +46 704 106975
>>> LinkedIn:  http://www.linkedin.com/in/neubauer
>>> Twitter:    @peterneubauer
>>> Tungle:    tungle.me/peterneubauer
>>>
>>> brew install neo4j && neo4j start
>>> heroku addons:add neo4j
>>>
>>> On Fri, Jan 20, 2012 at 11:33 PM, Timothy Braun <[email protected]> 
>>> wrote:
>>> > Hey Guys,
>>> >   Any suggestions for finding duplicate nodes based on a node index ie,
>>> > index on username property of a node?
>>> >
>>> >   Also, is there any way to implement unique node constraints?
>>> >
>>> >   Thanks,
>>> >   Tim
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>>  
>>  
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] Duplicate Nodes

Reply via email to