Hi, Winfred: It might work best to have a changed flag in the rows in the source database. That is, each time you dump the database by
* Incrementing the value for tracking changed rows. * Selecting the rows with the old changed value to create the dump. * Update the rows with the old changed value to the unchanged value. * Insert documents for the dumped rows. If you need to delete rows from the source database, you would might defer the deletion until creating a dump and use a separate flag to track whether the change is an insert / update or delete. That way, you can produce a dump that can be used to insert and delete documents in MarkLogic. If the source database is inalterable, another approach might be: * Store the record hash in the document within MarkLogic and create a range index on it. * Outside of MarkLogic, calculate the record hash on each row in the dump. * Use a lexicon call to get the record hashes for the documents persisted in MarkLogic. * Delete documents with record hashes that are only in MarkLogic (and thus were either deleted or updated in the source database). * Insert documents for rows whose record hashes that are only in the source database (and thus were either inserted or updated in the source database). Hoping that helps, Erik Hennum ________________________________ From: [email protected] [[email protected]] on behalf of Winfred Zwaard [[email protected]] Sent: Sunday, March 02, 2014 11:47 AM To: [email protected] Subject: [MarkLogic Dev General] Fwd: Question about ML performance Hi, I am rather new to MarkLogic, and running into some performance problems. Here is what I try to accomplish: - I have a set of ML xml documents, each containing a record from my source database. Each document identified by the Primary Key from my source - Periodically I create a dump of my source database - Then I try to identify the records that have changed compared to the previous time I made my database dump. - My intention is to do this by taking the PK from my new dump, and create a hash 64 for the full record. And then try to compare this to the previous time I created my database dump. For a couple hundreds records this performs quite OK, but I get performance problems when running it against thousands or more records. Tried adding a range index, but still no better performing results. Can you help me out? I have included the script to create a dummy base set of XML documents, as well as a script to create a new dummy database dump (with every 100th record having a change). And a script to check which records have changed. This latter script functionally works, but it is very slow. Do you have better ideas? Would it for instance help to create a separate set of documents that only contains the primary keys and hash totals to check? Thanks for your help Winfred Zwaard DIKW consultancy
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
