Agreed. Except I'd have the Java (or .NET) loader do the MD5 sum in both
cases, because it needs to be Java's byte-for-byte view of the document that
gets compared. Java can attach it to the document in MarkLogic as a property.
For deletions, you may want to use a lexicon (range index) to get an efficient
full listing of all file URIs and MD5 sum values. Then your Java code can do a
manifest comparison and quickly know which files to insert and which to delete.
-jh-
On Mar 5, 2010, at 12:00 PM, Kelly Stirman wrote:
> Hi David,
>
> You could do the following:
>
> 1) calculate the MD5 hash of the file during load (MarkLogic can do this)
> 2) store the MD5 hash as a property (so it works for all file types)
> 3) before the insert check to see if the file exists already, and if it does,
> whether the md5 hash is the same
> 4) if the same, do nothing, else insert and set the md5 hash property at the
> same time
>
> I would use xdmp:exists(fn:doc("foo")) to see if the doc already exists.
>
> Kelly
>
> Message: 7
> Date: Fri, 5 Mar 2010 11:48:35 -0800
> From: "Lee, David" <[email protected]>
> Subject: [MarkLogic Dev General] "Smart" bulk updates
> To: <[email protected]>
> Message-ID: <dd37f70d78609d4e9587d473fc61e0a716d92...@postoffice>
> Content-Type: text/plain; charset="us-ascii"
>
> I have a task coming up where I need to daily update a large set of xml
> and binary files from an outside source.
>
> This is about 6000 xml docs and 30,000 images. About 2GB total.
>
>
>
> I get these from an outside source as one huge 1GB zip file. I expect
> maybe only 1% of the files to have changed in any drop, maybe even less
> (.1%?).
>
>
>
> For any changed files I need to generates some additional data (outside
> of ML) then upload the files and update some properties.
>
> I *could* just update ALL files every day, but I'd like to be more
> efficient then that considering the likely change rate is so low.
>
>
>
> I'm sure this is a common problem (not unlike say rsync) ...
>
> What do people do for this case ?
>
> I was thinking of storing a checksum (MD5?) as a property of each file
> then comparing with the new files by listing the directory tree from ML.
>
> Another idea is to keep a filesystem cache of whats in ML and do the
> comparison there.
>
>
>
> My guess is it would be just as (in)efficient to try to upload each file
> to compare within ML as just updating the document,
>
> or visa-vera - fetch each file from ML just to compare with the
> filesystem. So I dont want to go that route.
>
>
>
> Then there is also the deleted issue ... I need to detect files which
> are no longer in the dataset and delete them.
>
>
>
>
>
>
>
> Any suggestions or ideas ? Anyone do something like this before ?
> Is there builtin marklogic features that could help ?
>
>
>
>
>
> Thanks;
>
>
>
> -David
>
>
>
>
>
>
>
> ----------------------------------------
>
> David A. Lee
>
> Senior Principal Software Engineer
>
> Epocrates, Inc.
>
> [email protected] <mailto:[email protected]>
>
> 812-482-5224
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general