> as we want to be able to use the point-in-time query feature to track > document changes over time
Point-in-time queries <https://docs.marklogic.com/guide/app-dev/point_in_time> are not designed for versioning, as I think you're describing it. The timestamps are internal bookkeeping. (Think of them as monotonically increasing integers rather than wall clock readings.) Querying at specific timestamp relies on _not_ merging deleted fragments. For short windows, like minutes or even hours, depending on your workload, this is OK. However, merging is necessary and useful to maintain the health of a database. A good use case for point-in-time queries is to get a consistent snapshot of query results across multiple requests. For example, run a query and get the first page of results. Capture the timestamp at which that query ran. Run queries for each subsequent page at that initial timestamp until there are no more pages. This allows you to split queries across multiple transactions, such as a multi-threaded export. (This is, in fact, what mlcp does when employing the snapshot option on an export <https://docs.marklogic.com/guide/mlcp/export#id_43184>.) > we'd like to avoid reinserting them as we want to be able to use the > point-in-time query feature to track document changes over time You can do this in mlcp with a transform <https://docs.marklogic.com/guide/mlcp/import#id_82518>. The transform runs server-side. You'll still need to send the data to the server to check the hash. (Put a range index on the hash and this check will be speedy.) However, you can avoid the indexing/write cost by returning null/empty sequence from your ingest transform. You could capture the new hash as part of the same transform for documents that have changed. In pseudo-JavaScript: if(hashExists(doc)) { return null; } else { return updateWithHash(doc, calculateHash(doc)); } Justin -- Justin Makeig Director, Product Management MarkLogic [email protected] > On Jun 27, 2016, at 9:45 PM, Hans Hübner <[email protected]> wrote: > > Hi, > > we're planning to use MarkLogic to do regular bulk updates on a larger set of > documents (~1 million). Many of the documents will be unchanged from their > previous version, and we'd like to avoid reinserting them as we want to be > able to use the point-in-time query feature to track document changes over > time. I've read an old thread in this forum that suggested calculating a > checksum over each input document and then only writing it to the database if > the previous version's checksum differs. In that same thread, it was also > suggested that xqsync could be used. > > Now xqsync apparently was replaced by mlcp, and I can find an indication in > the mlcp documentation that it avoids writing unchanged documents. > > Can anyone suggest the best way to approach this? > > Thanks! > Hans > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
