Agreed.  Except I'd have the Java (or .NET) loader do the MD5 sum in both 
cases, because it needs to be Java's byte-for-byte view of the document that 
gets compared.  Java can attach it to the document in MarkLogic as a property.

For deletions, you may want to use a lexicon (range index) to get an efficient 
full listing of all file URIs and MD5 sum values.  Then your Java code can do a 
manifest comparison and quickly know which files to insert and which to delete.

-jh-

On Mar 5, 2010, at 12:00 PM, Kelly Stirman wrote:

> Hi David,
> 
> You could do the following:
> 
> 1) calculate the MD5 hash of the file during load (MarkLogic can do this)
> 2) store the MD5 hash as a property (so it works for all file types)
> 3) before the insert check to see if the file exists already, and if it does, 
> whether the md5 hash is the same
> 4) if the same, do nothing, else insert and set the md5 hash property at the 
> same time
> 
> I would use xdmp:exists(fn:doc("foo")) to see if the doc already exists.
> 
> Kelly
> 
> Message: 7
> Date: Fri, 5 Mar 2010 11:48:35 -0800
> From: "Lee, David" <[email protected]>
> Subject: [MarkLogic Dev General] "Smart" bulk updates
> To: <[email protected]>
> Message-ID: <dd37f70d78609d4e9587d473fc61e0a716d92...@postoffice>
> Content-Type: text/plain; charset="us-ascii"
> 
> I have a task coming up where I need to daily update a large set of xml
> and binary files from an outside source.
> 
> This is about 6000 xml docs and 30,000 images.  About 2GB total.
> 
> 
> 
> I get these from an outside source as one huge 1GB zip file.  I expect
> maybe only 1% of the files to have changed in any drop, maybe even less
> (.1%?).
> 
> 
> 
> For any changed files I need to generates some additional data (outside
> of ML) then upload the files and update some properties.
> 
> I *could* just update ALL files every day, but I'd like to be more
> efficient then that considering the likely change rate is so low.
> 
> 
> 
> I'm sure this is a common problem (not unlike say rsync) ... 
> 
> What do people do for this case ? 
> 
> I was thinking of storing a checksum (MD5?) as a property of each file
> then comparing with the new files by listing the directory tree from ML.
> 
> Another idea is to keep a filesystem cache of whats in ML and do the
> comparison there. 
> 
> 
> 
> My guess is it would be just as (in)efficient to try to upload each file
> to compare within ML as just updating the document,
> 
> or visa-vera - fetch each file from ML just to compare with the
> filesystem.  So I dont want to go that route.
> 
> 
> 
> Then there is also the deleted issue ... I need to detect files which
> are no longer in the dataset and delete them.
> 
> 
> 
> 
> 
> 
> 
> Any suggestions or ideas ?  Anyone do something like this before ?
> Is there builtin marklogic features that could help ?
> 
> 
> 
> 
> 
> Thanks;
> 
> 
> 
> -David
> 
> 
> 
> 
> 
> 
> 
> ----------------------------------------
> 
> David A. Lee
> 
> Senior Principal Software Engineer
> 
> Epocrates, Inc.
> 
> [email protected] <mailto:[email protected]> 
> 
> 812-482-5224
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to