Hi, Winfred:

It might work best to have a changed flag in the rows in the source database.  
That is, each time you dump the database by

*  Incrementing the value for tracking changed rows.
*  Selecting the rows with the old changed value to create the dump.
*  Update the rows with the old changed value to the unchanged value.
*  Insert documents for the dumped rows.

If you need to delete rows from the source database, you would might defer the 
deletion until creating a dump and use a separate flag to track whether the 
change is an insert / update or delete.  That way, you can produce a dump that 
can be used to insert and delete documents in MarkLogic.

If the source database is inalterable, another approach might be:

*  Store the record hash in the document within MarkLogic and create a range 
index on it.
*  Outside of MarkLogic, calculate the record hash on each row in the dump.
*  Use a lexicon call to get the record hashes for the documents persisted in 
MarkLogic.
*  Delete documents with record hashes that are only in MarkLogic (and thus 
were either deleted or updated in the source database).
*  Insert documents for rows whose record hashes that are only in the source 
database (and thus were either inserted or updated in the source database).


Hoping that helps,


Erik Hennum

________________________________
From: [email protected] 
[[email protected]] on behalf of Winfred Zwaard 
[[email protected]]
Sent: Sunday, March 02, 2014 11:47 AM
To: [email protected]
Subject: [MarkLogic Dev General] Fwd: Question about ML performance

Hi,

I am rather new to MarkLogic, and running into some performance problems.

Here is what I try to accomplish:
- I have a set of ML xml documents, each containing a record from my source 
database. Each document identified by the Primary Key from my source
- Periodically I create a dump of my source database
- Then I try to identify the records that have changed compared to the previous 
time I made my database dump.
- My intention is to do this by taking the PK from my new dump, and create a 
hash 64 for the full record. And then try to compare this to the previous time 
I created my database dump.

For a couple hundreds records this performs quite OK, but I get performance 
problems when running it against thousands or more records.

Tried adding a range index, but still no better performing results. Can you 
help me out? I have included the script to create a dummy base set of XML 
documents, as well as a script to create a new dummy database dump (with every 
100th record having a change). And a script to check which records have 
changed. This latter script functionally works, but it is very slow.

Do you have better ideas? Would it for instance help to create a separate set 
of documents that only contains the primary keys and hash totals to check?

Thanks for your help
Winfred Zwaard

DIKW consultancy




_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to