I'm starting to need to import to ML a large set of data that comes from
an RDBMS world and is fully normalized.

I have a "master"  file, and in it many "keys" , and then many
"reference" files.

I'm starting to convert these files one by one to XML.

 

There are 2 fundamental kinds of "keys" ...  Parent/Child type keys,
and Code/Value type keys.

 

Parent/Child

 

In the master file I have

 

MASTER_ID,value,value,value ..

 

Then I have a child file I have

MASTER_ID,CHILD_ID,value,value,value

MASTER_ID,CHILD_ID,value,value,value

 

...

 

 

Where there can be 1 or more child records per master.  

This lends itself nicely to a XML structure like

 

<MASTER>
   <CHILD>

   <CHILD>

   ...

</MASTER>

 

 

 

Then I have code/value type keys.  For example in any of the records I
might have a code "123"  then there is a "code file" with values like

 

123,This is the definition of 123

123,This is the definition of 124

 

 

I think this is pretty much the standard form for fully normalized RDBMS
data.

In converting to a good XML structure for MarkLogic in particular ... I
have a few choices

 

 

1)      Simply convert each file to a single XML document.
On demand, cross-link the related values

2)      Combine the Master/Child type relations into a master XML
document, but do not expand the Code/Value relations
On demand look up the value for each code

3)      Combine everything and produce a fully expanded tree as one
document (or set of "master" documents depending on how I split it)

 

 

#2 seems pretty obvious, in my case it wont cause any space explosion
... but it does suffer from the Update problem (when I update any of the
child  records these have to be re-combined in-mass)

#3 will cause size exposing obviously,  but will result in a much more
readable and usable master document.   In my case I have no real use for
the "key" values although they could be preserved as attributes.

 

In this case I'm dealing with data in the 100MB range (before
denormalizing it).

 

I would appreciate any comments ... I'm sure this is a common problem.
In ML in particular are there any great advantages or disadvantages to
the different approaches ?

 

Search, for instance.  If I dont denormalize, then I could more easily
constrain searches to NOT include the expanded key/value values (just
search the master doc).

OTOH maybe I WANT to search the expanded values ... 

 

Speed of rendering the final document is certianly effected, but I
suspect not by much ... ? might depend on how I fragment the files.

 

Suggestions welcome !

 

 

 

 

 

 

 

----------------------------------------

David A. Lee

Senior Principal Software Engineer

Epocrates, Inc.

[email protected] <mailto:[email protected]> 

812-482-5224

 

 

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to