David,

When thinking about relational data in ML, I find it is best to think about a 
document corresponding to a row, rather than a document corresponding to a 
table. In your other thread, you have something more like a document 
corresponding to a table: 20,000 records in one document.

You should denormalize your relational records fully, as suggested in option 3. 
Option 2 would work almost as well, but I don't see a reason not to denormalize 
unless you are planning on frequent updates. The upside to denormalization is 
that you do not need to perform joins, and as you say, from a development and 
readability perspective, it is a lot easier to look at a document like a "view" 
of all the data you need related to a particular record. The downside is that 
1) you may take up slightly more space, and 2) updates may be more expensive to 
process. Since your data set is very small, I don't imagine the first one 
matters (and even in large data sets, the cost of storage can be trivial 
compared to the performance benefits). If you are planning for lots of updates, 
then we should talk about whether you want to optimize for updates or 
reads/searches.

Kelly

Message: 3
Date: Sun, 22 Nov 2009 09:08:56 -0800
From: "Lee, David" <[email protected]>
Subject: [MarkLogic Dev General] To Normalize or Combine ? A
        philosophical   design question
To: <[email protected]>
Message-ID: <dd37f70d78609d4e9587d473fc61e0a714055...@postoffice>
Content-Type: text/plain; charset="us-ascii"

I'm starting to need to import to ML a large set of data that comes from an 
RDBMS world and is fully normalized.

I have a "master"  file, and in it many "keys" , and then many "reference" 
files.

I'm starting to convert these files one by one to XML.

 

There are 2 fundamental kinds of "keys" ...  Parent/Child type keys, and 
Code/Value type keys.

 

Parent/Child

 

In the master file I have

 

MASTER_ID,value,value,value ..

 

Then I have a child file I have

MASTER_ID,CHILD_ID,value,value,value

MASTER_ID,CHILD_ID,value,value,value

 

...

 

 

Where there can be 1 or more child records per master.  

This lends itself nicely to a XML structure like

 

<MASTER>
   <CHILD>

   <CHILD>

   ...

</MASTER>

 

 

 

Then I have code/value type keys.  For example in any of the records I might 
have a code "123"  then there is a "code file" with values like

 

123,This is the definition of 123

123,This is the definition of 124

 

 

I think this is pretty much the standard form for fully normalized RDBMS data.

In converting to a good XML structure for MarkLogic in particular ... I have a 
few choices

 

 

1)      Simply convert each file to a single XML document.
On demand, cross-link the related values

2)      Combine the Master/Child type relations into a master XML
document, but do not expand the Code/Value relations On demand look up the 
value for each code

3)      Combine everything and produce a fully expanded tree as one
document (or set of "master" documents depending on how I split it)

 

 

#2 seems pretty obvious, in my case it wont cause any space explosion ... but 
it does suffer from the Update problem (when I update any of the child  records 
these have to be re-combined in-mass)

#3 will cause size exposing obviously,  but will result in a much more
readable and usable master document.   In my case I have no real use for
the "key" values although they could be preserved as attributes.

In this case I'm dealing with data in the 100MB range (before denormalizing it).

I would appreciate any comments ... I'm sure this is a common problem.
In ML in particular are there any great advantages or disadvantages to the 
different approaches ?

Search, for instance.  If I dont denormalize, then I could more easily 
constrain searches to NOT include the expanded key/value values (just search 
the master doc).

OTOH maybe I WANT to search the expanded values ... 

Speed of rendering the final document is certianly effected, but I suspect not 
by much ... ? might depend on how I fragment the files.

Suggestions welcome !

----------------------------------------

David A. Lee

Senior Principal Software Engineer

Epocrates, Inc.

[email protected] <mailto:[email protected]> 

812-482-5224
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to