RE: [MarkLogic Dev General] De-Duping Data

Danny Sokolsky Fri, 30 Oct 2009 09:18:25 -0700

Hi Tony,

Here are a few things for you to think about.



1)      What is your criteria for the name/addr being "the same"?  For example, 
do they need to be codepoint-for-codepoint the same, or is it enough for them 
to both match the same search?

2)      If name/addr are a key, can you put them in the same field?  That might 
make it easier.

3)      You say thousands of sessions-is that a max number, or do you expect 
that to grow a lot.  Because a query that returns and processes 1,000 documents 
is a bit different than a query that returns and processes 99,000 documents.  
Also, can you size the system accordingly?

It seems to me you can do this a brute-force way by first finding all the 
documents.  Here is one strategy I can think of (it will work better if 
name/addr is in a single element):


1)      Put a string range index on name (or name/addr if a single element).

2)      Use cts:element-values to get all the unique names (or name/addr).

3)      Get the frequency of each result (cts:frequency).  If I understand your 
requirement, you can throw out any results that have 3 or fewer counts.

4)      Take the list of names and use them as phrases, and create a 
cts:or-query of all the phrases.

5)      Pass the cts:or-query into cts:search.  This will return all of the 
unique sessions with 4 or more entries.

6)      If you have addr separate, you would need to similarly iterate over 
those, and take the results and add them to your or-query.

7)      Now that you have all the candidate results, process the dates and 
figure out the ones to keep based on your criteria.

You could do something similar by first looking at the dates (maybe with a 
range index on the date element), then getting all the sessions for October, 
all for September, and so on, then looking for the unique names out of those.  
I think it will depend on the distribution of your data which will be more 
efficient.

-Danny

From: [email protected] 
[mailto:[email protected]] On Behalf Of Tony Mariella
Sent: Friday, October 30, 2009 6:56 AM
To: [email protected]
Subject: [MarkLogic Dev General] De-Duping Data

If I have my Marklogic database setup and there are thousands of sessions in 
the DB, each session looks something like this:

<item>
   <name></name>
   <addr></addr>
   <date></date>
   <size></size>
   <score></score>
   <id></id>
</item>

I want to write a query that goes through the entire database and uses the 
following criteria
name, addr, date to give me all the duplicate entries in the DB. I want to keep 
3 sessions for each month for each name/addr. How can I fashion a query that 
helps me to do this ?

Tony Mariella
Raytheon Company

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] De-Duping Data

Reply via email to