Re: [MarkLogic Dev General] Bulk updates (xqsync vs. mlcp)

Danny Sokolsky Wed, 29 Jun 2016 14:30:44 -0700

You are correct, Hans, setting a merge timestamp does not disable merges.

The downsides to never getting rid of deleted fragments is that your database 
can grow without bound, without sensible ways to manage it.  Point-in-time 
queries are really meant for relatively short durations.   Some of the 
consequences of keeping all old versions are:

·         Relevance:  relevance is calculated based on all fragments in the 
database, so if, for example, you happened to have 1,000,000 versions of a 
particular document due to a bug you had in your application code that kept 
updating the same document (or for whatever reason), that would probably make 
things in less relevant than it would otherwise.

·         Manageablitly:  there is no way to manage the old versions; they are 
all always there.

·         Size:  your database might get very large.

Point-in-time queries are very useful for things like:

·         Pagination: if you have a requirement that many pages of search 
results give the exact same answers for a relatively period of time (for 
example, an hour, or a day), you can keep the last day around and query those 
at a point in time.

·         Publishing a new version of documents:  If you want to load a new 
version of documents (say a magazine or similar) and test it in your production 
system while still having the old version be production, you can set the merge 
timestamp, make the users of the old version query at a point in time, load the 
new versions, and test the new stuff at the current timestamp.  There are lots 
of other ways to do this, but point in time is one way.

It might be tempting to treat point-in-time queries for generic versioning, but 
it is usually not what you want.

Does that help to clarify?

-Danny

From: [email protected] 
[mailto:[email protected]] On Behalf Of Hans Hübner
Sent: Wednesday, June 29, 2016 12:19 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Bulk updates (xqsync vs. mlcp)

Justin,

thank you for the additional documentation pointer.  From what I read, I 
understand that merging is a useful operation and that merges should not be 
disabled.  I can agree to that, but as far as I have understood, the 
point-in-time feature does not require that we disable merges.  It just 
requires that the merge timestamp is set to the earliest point back in time 
where we want to be able to look back to.  Does setting the merge timestamp 
automatically disable the merges?

What I am still missing is why the "Inside MarkLogic" document describes how 
MVCC timestamps can be used to implement "Time Travel" and the "Application 
Developer's Guide" describe point-in-time queries if you (assuming that you 
speak for MarkLogic) advise against using them.  The "Application Developer's 
Guide" in particular describes how such queries work, in detail, and it does 
not mention that one should avoid the technique.

Is the documentation accurate?  Under what circumstances do you recommend using 
the point-in-time technique described in the guide?  Does the point-in-time 
query technique only work if merges are disabled?

Hans

On Wed, Jun 29, 2016 at 7:40 PM, Justin Makeig 
<[email protected]<mailto:[email protected]>> wrote:
Can you elaborate what you mean by "maintain the health of a database"?  If 
we'd decide that we never want to delete any data in a certain MarkLogic 
database so that we can roll back to any point in time, what would be the down 
sides?  How would the database become unhealthy?

Please take a look at the docs on merging, specifically the section, "Merges 
Are Good" <https://docs.marklogic.com/guide/admin/merges#id_43904>. Merging is 
the way that MarkLogic manages its internal data to support efficient and 
consistent ingest and query I/O. It is an internal process and completely 
orthogonal to how you version your documents.

What you describe sounds more like temporal versioning. Please take a look at 
MarkLogic's bitemporal APIs <https://docs.marklogic.com/guide/temporal/intro>. 
With bitemporal management you maintain an immutable copy of the entire history 
of your data that you can query at any point in time. The APIs do all of the 
sophisticated work maintaining versions securely. The "bi" in bitemporal allows 
you to query the valid time of the document (e.g. a trade was effective on 
2016-06-01) as you knew it at any point in time (e.g. the trade wasn't recorded 
until 2016-06-02 and then it was corrected on 2016-06-05).

Justin

On Jun 28, 2016, at 9:55 PM, Hans Hübner 
<[email protected]<mailto:[email protected]>> wrote:

On Tue, Jun 28, 2016 at 10:36 PM, Justin Makeig 
<[email protected]<mailto:[email protected]>> wrote:
> as we want to be able to use the point-in-time query feature to track 
> document changes over time

Point-in-time queries <https://docs.marklogic.com/guide/app-dev/point_in_time> 
are not designed for versioning, as I think you're describing it. The 
timestamps are internal bookkeeping. (Think of them as monotonically increasing 
integers rather than wall clock readings.) Querying at specific timestamp 
relies on _not_ merging deleted fragments. For short windows, like minutes or 
even hours, depending on your workload, this is OK. However, merging is 
necessary and useful to maintain the health of a database.

Can you elaborate what you mean by "maintain the health of a database"?  If 
we'd decide that we never want to delete any data in a certain MarkLogic 
database so that we can roll back to any point in time, what would be the down 
sides?  How would the database become unhealthy?

We have an existing application that makes use of another database system 
(Datomic) exactly in that way, and we would like to carry it over to MarkLogic. 
 The "Inside MarkLogic" document describes point-in-time queries as "Time 
Travel", but what you write seems to say that using timestamps that way would 
be detrimental to the health of the database, so I'd like to learn more before 
we convert.

Thanks!
Hans

--
LambdaWerk GmbH
Oranienburger Straße 87/89
10178 Berlin
Phone: +49 30 555 7335 0
Fax: +49 30 555 7335 99

HRB 169991<tel:169991> B Amtsgericht Charlottenburg
USt-ID: DE301399951
Geschäftsführer:  Hans Hübner

http://lambdawerk.com/

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

--
LambdaWerk GmbH
Oranienburger Straße 87/89
10178 Berlin
Phone: +49 30 555 7335 0
Fax: +49 30 555 7335 99

HRB 169991 B Amtsgericht Charlottenburg
USt-ID: DE301399951
Geschäftsführer:  Hans Hübner

http://lambdawerk.com/

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Bulk updates (xqsync vs. mlcp)

Reply via email to