Hi Tony, Here are a few things for you to think about.
1) What is your criteria for the name/addr being "the same"? For example, do they need to be codepoint-for-codepoint the same, or is it enough for them to both match the same search? 2) If name/addr are a key, can you put them in the same field? That might make it easier. 3) You say thousands of sessions-is that a max number, or do you expect that to grow a lot. Because a query that returns and processes 1,000 documents is a bit different than a query that returns and processes 99,000 documents. Also, can you size the system accordingly? It seems to me you can do this a brute-force way by first finding all the documents. Here is one strategy I can think of (it will work better if name/addr is in a single element): 1) Put a string range index on name (or name/addr if a single element). 2) Use cts:element-values to get all the unique names (or name/addr). 3) Get the frequency of each result (cts:frequency). If I understand your requirement, you can throw out any results that have 3 or fewer counts. 4) Take the list of names and use them as phrases, and create a cts:or-query of all the phrases. 5) Pass the cts:or-query into cts:search. This will return all of the unique sessions with 4 or more entries. 6) If you have addr separate, you would need to similarly iterate over those, and take the results and add them to your or-query. 7) Now that you have all the candidate results, process the dates and figure out the ones to keep based on your criteria. You could do something similar by first looking at the dates (maybe with a range index on the date element), then getting all the sessions for October, all for September, and so on, then looking for the unique names out of those. I think it will depend on the distribution of your data which will be more efficient. -Danny From: [email protected] [mailto:[email protected]] On Behalf Of Tony Mariella Sent: Friday, October 30, 2009 6:56 AM To: [email protected] Subject: [MarkLogic Dev General] De-Duping Data If I have my Marklogic database setup and there are thousands of sessions in the DB, each session looks something like this: <item> <name></name> <addr></addr> <date></date> <size></size> <score></score> <id></id> </item> I want to write a query that goes through the entire database and uses the following criteria name, addr, date to give me all the duplicate entries in the DB. I want to keep 3 sessions for each month for each name/addr. How can I fashion a query that helps me to do this ? Tony Mariella Raytheon Company
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
