I'm starting to need to import to ML a large set of data that comes from an RDBMS world and is fully normalized.
I have a "master" file, and in it many "keys" , and then many "reference" files. I'm starting to convert these files one by one to XML. There are 2 fundamental kinds of "keys" ... Parent/Child type keys, and Code/Value type keys. Parent/Child In the master file I have MASTER_ID,value,value,value .. Then I have a child file I have MASTER_ID,CHILD_ID,value,value,value MASTER_ID,CHILD_ID,value,value,value ... Where there can be 1 or more child records per master. This lends itself nicely to a XML structure like <MASTER> <CHILD> <CHILD> ... </MASTER> Then I have code/value type keys. For example in any of the records I might have a code "123" then there is a "code file" with values like 123,This is the definition of 123 123,This is the definition of 124 I think this is pretty much the standard form for fully normalized RDBMS data. In converting to a good XML structure for MarkLogic in particular ... I have a few choices 1) Simply convert each file to a single XML document. On demand, cross-link the related values 2) Combine the Master/Child type relations into a master XML document, but do not expand the Code/Value relations On demand look up the value for each code 3) Combine everything and produce a fully expanded tree as one document (or set of "master" documents depending on how I split it) #2 seems pretty obvious, in my case it wont cause any space explosion ... but it does suffer from the Update problem (when I update any of the child records these have to be re-combined in-mass) #3 will cause size exposing obviously, but will result in a much more readable and usable master document. In my case I have no real use for the "key" values although they could be preserved as attributes. In this case I'm dealing with data in the 100MB range (before denormalizing it). I would appreciate any comments ... I'm sure this is a common problem. In ML in particular are there any great advantages or disadvantages to the different approaches ? Search, for instance. If I dont denormalize, then I could more easily constrain searches to NOT include the expanded key/value values (just search the master doc). OTOH maybe I WANT to search the expanded values ... Speed of rendering the final document is certianly effected, but I suspect not by much ... ? might depend on how I fragment the files. Suggestions welcome ! ---------------------------------------- David A. Lee Senior Principal Software Engineer Epocrates, Inc. [email protected] <mailto:[email protected]> 812-482-5224
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
