Re: [MarkLogic Dev General] Bulk updates (xqsync vs. mlcp)

Justin Makeig Tue, 28 Jun 2016 13:37:04 -0700

> as we want to be able to use the point-in-time query feature to track 
> document changes over time

Point-in-time queries <https://docs.marklogic.com/guide/app-dev/point_in_time> 
are not designed for versioning, as I think you're describing it. The 
timestamps are internal bookkeeping. (Think of them as monotonically increasing 
integers rather than wall clock readings.) Querying at specific timestamp 
relies on _not_ merging deleted fragments. For short windows, like minutes or 
even hours, depending on your workload, this is OK. However, merging is 
necessary and useful to maintain the health of a database. A good use case for 
point-in-time queries is to get a consistent snapshot of query results across 
multiple requests. For example, run a query and get the first page of results. 
Capture the timestamp at which that query ran. Run queries for each subsequent 
page at that initial timestamp until there are no more pages. This allows you 
to split queries across multiple transactions, such as a multi-threaded export. 
(This is, in fact, what mlcp does when employing the snapshot option on an 
export <https://docs.marklogic.com/guide/mlcp/export#id_43184>.)

>  we'd like to avoid reinserting them as we want to be able to use the 
> point-in-time query feature to track document changes over time

You can do this in mlcp with a transform 
<https://docs.marklogic.com/guide/mlcp/import#id_82518>. The transform runs 
server-side. You'll still need to send the data to the server to check the 
hash. (Put a range index on the hash and this check will be speedy.) However, 
you can avoid the indexing/write cost by returning null/empty sequence from 
your ingest transform. You could capture the new hash as part of the same 
transform for documents that have changed. In pseudo-JavaScript:

if(hashExists(doc)) {
  return null;
} else {
  return updateWithHash(doc, calculateHash(doc));
}

Justin

--
Justin Makeig
Director, Product Management
MarkLogic
[email protected]

> On Jun 27, 2016, at 9:45 PM, Hans Hübner <[email protected]> wrote:
> 
> Hi,
> 
> we're planning to use MarkLogic to do regular bulk updates on a larger set of 
> documents (~1 million).  Many of the documents will be unchanged from their 
> previous version, and we'd like to avoid reinserting them as we want to be 
> able to use the point-in-time query feature to track document changes over 
> time.  I've read an old thread in this forum that suggested calculating a 
> checksum over each input document and then only writing it to the database if 
> the previous version's checksum differs.  In that same thread, it was also 
> suggested that xqsync could be used.
> 
> Now xqsync apparently was replaced by mlcp, and I can find an indication in 
> the mlcp documentation that it avoids writing unchanged documents.
> 
> Can anyone suggest the best way to approach this?
> 
> Thanks!
> Hans
> 
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at: 
> http://developer.marklogic.com/mailman/listinfo/general

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Bulk updates (xqsync vs. mlcp)

Reply via email to