Re: [MarkLogic Dev General] Processing Large Documents?

Todd Gochenour Sun, 26 Feb 2012 22:10:12 -0800

Here is what I've accomplished this weekend.  My next step in this process
is to scrub the foreign keys from the database.   The foreign keys have the
pattern {table}_id, so for instance the field /xyz/usr_id would reference
the primary document /usr/id.  Leveraging this pattern, I want to
accomplish two things.  I want to convert the foreign key reference to a
normal form.  Second, I want to insert a reverse reference in the primary
document


So the document:

<xyz>
  <id>456</id>
  <usr_id>123</usr_id>
</xyz>

becomes

<xyz>
   <id>456</id>
    <usr>
      <id>123</id>
  </usr>
</xyz>

and the primary document gets a back reference:

<usr>
   <id>123</id>
   <xyz>
      <id>456</id>
   </xyz>
</usr>

The next phase beyond this is to replace nested instances of documents
(those with a child id) with copies from the primary documents.  After two
or three passes the documents will be de-normalized, each pass giving more
depth to the hierarchy.   Finally, trim away documents trees which aren't
useful leaving the documents that provide the most logical hierarchy.

I wrote my first attempt at scrubbing the foreign keys.  It isn't
performing well at all, running now for 3 hours and only a third of the way
done.  I've never seen my computer work so hard.   I have 188058
documents.  I spawned off 117 processes, one for each unique foreign key in
the database.   Each process invokes the code below.

I really hoped this process would have been quicker.  I'm looking for ways
to improve the performance.  One idea I have is instead of processing each
foreign-key (which touches multiple primary and foreign documents at once),
I'm thinking I should iterate through each primary document and process a
single document in isolation.  I'm also thinking about inserting new
documents into a separate collection as it seems that update to existing
documents is generating contention for the same document.   I could also
use a better understanding of the benefits of the transaction logic Geert
presented earlier (processing 100 records in a single transaction) and how
that might have an impact.

There is so much more to learn.  MarkLogic claims to be "The Operational
Database for Big Data".   I've got some big data.  Can MarkLogic operate
with it?


(: scrub-foreignkeys.xqy, a module on the filesystem in the app-server
module root :)
xquery version "1.0-ml";
declare variable $key external ;
let $tables := /*[name(.)=$key/@reference-table]
let $locations := /*/*[name(.)=$key]
for $location in $locations
  let $table := $tables[id=$location]
  let $parent-name := $location/../name(.)
  let $parent-id := $location/../id/text()
  let $new-location-ref :=
    element {$key/@reference-table} {
      if($key/@context) then attribute context {$key/@context} else (),
      element id {$location/text()}
    }
  let $new-origin-ref :=
    element {$parent-name} {
      if($key/@context) then attribute context {$key/@context} else (),
      element id {$parent-id}
    }
  return (
    if($table and $parent-id)
      then if(not($table/*[name(.)=$parent-name and id=$location]))
        then xdmp:node-insert-child($table, $new-origin-ref)
        else ()
      else (),
    xdmp:node-replace($location,$new-location-ref),
    xdmp:log(concat("KEY:",$key," CONTEXT:",$key/@context,"
REFERENCE-TABLE:",$key/@reference-table," LOCATION:",$location, "
ORIGIN-TABLE-NAME:",$parent-name," ORIGIN-TABLE-ID:",$parent-id))
  )

And the external variable $key will look like:

<key origin-table="tsk_usr" reference-table="tsk">tsk_id</key>

Thanks,
Todd Gochenour
Servicelogix.com

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Processing Large Documents?

Reply via email to