David, When thinking about relational data in ML, I find it is best to think about a document corresponding to a row, rather than a document corresponding to a table. In your other thread, you have something more like a document corresponding to a table: 20,000 records in one document.
You should denormalize your relational records fully, as suggested in option 3. Option 2 would work almost as well, but I don't see a reason not to denormalize unless you are planning on frequent updates. The upside to denormalization is that you do not need to perform joins, and as you say, from a development and readability perspective, it is a lot easier to look at a document like a "view" of all the data you need related to a particular record. The downside is that 1) you may take up slightly more space, and 2) updates may be more expensive to process. Since your data set is very small, I don't imagine the first one matters (and even in large data sets, the cost of storage can be trivial compared to the performance benefits). If you are planning for lots of updates, then we should talk about whether you want to optimize for updates or reads/searches. Kelly Message: 3 Date: Sun, 22 Nov 2009 09:08:56 -0800 From: "Lee, David" <[email protected]> Subject: [MarkLogic Dev General] To Normalize or Combine ? A philosophical design question To: <[email protected]> Message-ID: <dd37f70d78609d4e9587d473fc61e0a714055...@postoffice> Content-Type: text/plain; charset="us-ascii" I'm starting to need to import to ML a large set of data that comes from an RDBMS world and is fully normalized. I have a "master" file, and in it many "keys" , and then many "reference" files. I'm starting to convert these files one by one to XML. There are 2 fundamental kinds of "keys" ... Parent/Child type keys, and Code/Value type keys. Parent/Child In the master file I have MASTER_ID,value,value,value .. Then I have a child file I have MASTER_ID,CHILD_ID,value,value,value MASTER_ID,CHILD_ID,value,value,value ... Where there can be 1 or more child records per master. This lends itself nicely to a XML structure like <MASTER> <CHILD> <CHILD> ... </MASTER> Then I have code/value type keys. For example in any of the records I might have a code "123" then there is a "code file" with values like 123,This is the definition of 123 123,This is the definition of 124 I think this is pretty much the standard form for fully normalized RDBMS data. In converting to a good XML structure for MarkLogic in particular ... I have a few choices 1) Simply convert each file to a single XML document. On demand, cross-link the related values 2) Combine the Master/Child type relations into a master XML document, but do not expand the Code/Value relations On demand look up the value for each code 3) Combine everything and produce a fully expanded tree as one document (or set of "master" documents depending on how I split it) #2 seems pretty obvious, in my case it wont cause any space explosion ... but it does suffer from the Update problem (when I update any of the child records these have to be re-combined in-mass) #3 will cause size exposing obviously, but will result in a much more readable and usable master document. In my case I have no real use for the "key" values although they could be preserved as attributes. In this case I'm dealing with data in the 100MB range (before denormalizing it). I would appreciate any comments ... I'm sure this is a common problem. In ML in particular are there any great advantages or disadvantages to the different approaches ? Search, for instance. If I dont denormalize, then I could more easily constrain searches to NOT include the expanded key/value values (just search the master doc). OTOH maybe I WANT to search the expanded values ... Speed of rendering the final document is certianly effected, but I suspect not by much ... ? might depend on how I fragment the files. Suggestions welcome ! ---------------------------------------- David A. Lee Senior Principal Software Engineer Epocrates, Inc. [email protected] <mailto:[email protected]> 812-482-5224 _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
