I've been thinking of going ahead and prototypeing this.  That is a
marklogic "rsync" type command. 

>From my experimentation the way I think would work best is as described
below (included email thread)

That is to set a property on all files which includes the md5 and length
(file length in bytes prior to uploading to ML).

Then using client side logic compare the new list of files to whats on
ML and generate a set of update/insert/delete commands.

I've already done this for a special case and it worked well, so
thinking of cleaning up the code and making it general purpose.

Although my purposes are for updateing ML ... there's no reason the
reverse couldnt also be done, to update with minimal operations a local
filesystem.

 

The questions I have are :

 

1) Would anyone be interested in this ?

 

2) How 'offensive' is storing a property on documents ?  Would this be a
'deal killer' ?  Should it be in a private namespace ?

 

3) How efficient is storing properties ? Does having to
read,store,update properties negate any time savings from avoiding the
load ?

 That is, I suspect for some size documents is actually faster just to
push them unconditionally rather then have to look at properties and
calculate MD5 sums to decide if to push ... 

 

4) I could avoid properties entirely by calculating the MD5 and length
on the fly in ML ... however I believe both require serializing the
document in memory in ML.   The xdmp:md5() takes a string, not a
document.  And there is no actual size method, that also requires
serializing the document.

The only way I can think of is to use xdmp:quote(doc(...)) then
calculate the length and md5 on the server.   My gut feeling is that
doing this is a very heavy weight operation on large files and would be
less efficient then just unconditionally pushing the document (except
maybe on very very slow networks).

Also I'm not sure (and I am highly suspicious its NOT true) that an MD5
calculated on a file on local disk wont match xdmp:md5(
xdmp:quote(doc(...))) for the same file due to serialization
differences.   Same with length . Thus making this strategy pointless.

 

 

 

 

-David

 

 

 

 

 

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Lee, David
Sent: Friday, June 11, 2010 10:00 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Mac Webdav Client setting
xqyfilesasbinary

 

I would LOVE help with this project.   (And yes I just checked in an
update a half hour ago ... hate to point people at old code :)

I've been thinking of exactly what your saying.  The only thing stopping
me besides time ... is I haven't figured out how to 

make sure the clocks are in sync and what the failure cases are if they
are not.

 

What I've done in another project is to use an MD5 checksum.   There is
a undocumented (its experimental) flag to put which adds a property with
a MD5 checksum.   xmlsh has a MD5 sum command
(http://www.xmlsh.org/CommandXmd5sum).

I generate a list of all documents with the MD5 sum,  compare against
local disk then update only changed files, propagating deletes, inserts,
and updates.   It worked great for one project ... but I have not
generalized this code yet ... 

 

I'm reluctant to blindly add properties to 'other peoples files' so I
havent made this into a general utility yet.

 

Discussion  greatly welcome ! (and help too ... )

-David

 

 

----------------------------------------

David A. Lee

Senior Principal Software Engineer

Epocrates, Inc.

[email protected]

812-482-5224

 

 

 

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Mike
Brevoort
Sent: Friday, June 11, 2010 9:43 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Mac Webdav Client setting xqy
filesasbinary

 

Thanks David, That looks really cool. 

 

I was just looking at the code (that I've seen you are actively working
on- checkins the last several minutes :)  )and it seems like it wouldn't
be too hard to create a a sync option for rsync like behavior (simpler
obviously). If given a source (filesystem) and destination (marklogic DB
directory) and depth (how far to recurse), we should be able to grab a
list of all of the files on the server, their content-length and last
updated dateTime. Then we could compare on the source filesystem for
new/deleted and by size and date updated to decide which files to get
and put. 

 

What do you think of that approach? I or someone on my team might be
willing to take a crack at this.

 

Also, what's required for others to run xmlsh on windows?

 

Thanks!

Mike

On Fri, Jun 11, 2010 at 6:19 AM, Lee, David <[email protected]> wrote:

You might want to consider the MarkLogic extension to xmlsh

http://www.xmlsh.org/ModuleMarkLogic

 

This includes a "put" command which works similary to rsync (not quite
as good as it doesnt handle minimal updates yet ... TBD)

 

http://www.xmlsh.org/MarkLogicPut

 

 

But I use it for scripting updates to modules.  It uses XDBC (XCC) not
WebDav.  You can set the file type explicitly (-t for text).

Or it uses the server default logic.

 

Its not as powerful as recordloader but its easier to use.

Example: I use this command to recursively copy my source .xquery file
tree to the modules DB

 

 

   ml:put -r -baseuri /App/ -maxfiles 10 -maxthreads 3 *

 

 

 

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Mike
Brevoort
Sent: Friday, June 11, 2010 12:20 AM
To: [email protected]
Subject: [MarkLogic Dev General] Mac Webdav Client setting xqy files
asbinary

 

Hi,

 

So I know that webdav clients always seem to have quirks and I've heard
hearsay that the Mac webdav client has some problems when interfacing
with MarkLogic, but....

 

I have a modules database mounted via webdav on a mac. When I copy in an
xquey file (test.xqy) via the native webdav client the content type of
the file is being set to "binary" but if I use Cyberduck to move the
file, it's being set to "text". When the type is set to binary, it fails
to execute

 

      <h1>500 Internal Server Error</h1>

      <dl>

        <dt> [1.0-ml]</dt>

        <dd>XDMP-TEXTNODE: /ctd/article.xqy -- Server unable to build
program from non-text document</dd>

        <dt>in /poc/article.xqy, on line 13 [1.0-ml]</dt>

        <dd>XDMP-UNDFUN: (err:XPST0017) Undefined function
comoms-article:getFields()</dd>

        <dt>in /poc/article.xqy, on line 15 [1.0-ml]</dt>

        <dd>XDMP-UNDFUN: (err:XPST0017) Undefined function
comoms-article:get()</dd>

        <dt>in /poc/article.xqy, on line 19 [1.0-ml]</dt>

        <dd>XDMP-UNDFUN: (err:XPST0017) Undefined function
comoms-article:post()</dd>

      </dl>

 

So two questions, anything I can do to affect how the Mac
client/MarkLogic deal with document types? Or if not, how can I convert
the document type via xquery? I'd really like to have the modules
database mountable so that I can use tools like rsync to move files (vs
a client like Cyberduck).

 

Thanks!

Mike


_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general




-- 
Mike Brevoort /  Enterprise Web Practice Manager /  Avalon Consulting
LLC /  303-834-7509 /  twitter:mbrevoort



_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to