Paul, Thanks for the deep comments on this type of stuff. I wanted to let you know that I do plan on responding, however at the moment I am extremely busy, so please be patient and I will provide my input soon.
V/r Anthony Eden On Feb 5, 2008 7:59 PM, Paul Cortens <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > I have recently started a project using ActiveWarehouse and ActiveWarehouse > ETL. When I chose these tools I knew they were very early in their > implementation; however, I wanted the flexibility that ruby offers so I > decided that using a "new" technology was worth it. > > > > Now that I have completed a prototype of my project, I have a lot of > questions about the architecture of both ActiveWarehouse and ActiveWarehouse > ETL. I have been thinking quite a lot about how I would design such tools. I > am curious if my ideas have been considered already, and if so why the > current implementation was chosen. If my ideas have not been considered > before, then I would like to hear some feedback from the community before I > start implementing them. > > > > Before I get into the (lengthy) list of ideas I have, I should give you some > background on where these ideas are coming from. > > > > I have been working at a foreign exchange company since I finished my > commerce degree 2 years ago. Most of my time there has been spent leading > the development of their (small) data warehouse. That DW is entirely built > on Microsoft products. Until I started using ActiveWarehouse, the Microsoft > tools were my only in-depth exposure to ETL tools. In case you haven't used > SSIS and SSAS, those products have many great selling features; however > there are many fatal flaws that contradict all the best practices that the > rails community encourages. DRY, good version control, and even the most > basic testing are all nearly impossible to do well with SSIS. On top of > this, very few SSIS developers seem to care!!! (I know some do.) > > > > I want to be involved with a development community that encourages best > practices, so I am trying to avoid Microsoft's DW tools. This (combined with > what I said before about the flexibility of ruby) brings me to > ActiveWarehouse—ActiveWarehouse ETL specifically. > > > > So here are my 'ideas' for how the ideal ETL tool should be designed. Please > keep in mind that these are just ideas, not recommendations. I would really > like to generate some good debate. Then, if we can agree on something I will > start working on an implementation. > > > > > > How does an ETL process (or a Data Warehouse in general) fit into the MVC > pattern? Dimensions and Facts are obviously models. But what about the ctl > file that fills in a dimension table using a csv and a few transforms? It > looks like a controller to me, but maybe it is just a really complex > constructor method for a model (i.e. def MyDimension.new(csv); insert > complex code; end). What are the arguments against fitting a DW into the MVC > pattern? > > > > > > Environments. I LOVE the different environments in rails. I especially love > how I don't have to think about them. They just work. Any ideas on how this > should/could be implemented? > > > > > > Testing… This fits right in with environments. I have had several headaches > from trying to test my ETL processes. I have also recently fallen in love > with RSpec. It should be really easy to stub/mock the various parts of the > ETL code: in, out, sk lookups, etc. How do you test your code? > > > > > > Readability. Any thoughts on how to make the ETL code more readable? I know > it is fairly readable know, but it isn't as readable as ruby/rails code can > be. > > > > > > SK Lookups. These should be class methods of the dimension model (think > specialized finders). The id of the dimension is the surrogate key. This (as > well as the business key) are stored in the dimension table. Therefore, the > dimension model should handle this logic, right? Maybe we can get an > sk_lookup method in the ActiveWarehouse::Dimension class. Then we could do > this in our ETL: MyDimension.sk_lookup(bk) to get the sk. Or we could do > MyDimension.sk_lookup(bk, effective_date) for SCDs. > > > > SCDs. This is another place where the main logic should be in the model. I > want to be able to hand the data to my dimension and it will insert and/or > update the table accordingly. The ETL process can handle getting the data > into the dimensional format because it may come from many different places, > but SCD handling will be the same for all data sources and can be handled by > the model. > > > > Convention Over Configuration. I think there is a ton of room for > time-saving conventions, especially with regards to SCDs. > > > > start_date, end_date, and enabled columns should be used to track the > effective dates and which is the current row. I would argue that these > columns should be created automatically (like id). > > bk – there should be a convention for naming the business key. I think the > column should be required—though the data type may vary, I can't think of a > dimension that doesn't have a business key. > > All columns (except id, bk, start_date, end_date, and enabled) are assumed > to be type 1 SCDs. This could be overridden by something like `default_scd > :historic`. > > However, columns prefixed by > > -'current_' => type 1 - the current value > > -'first_' => never change - the first value > > -'historic_' => type 2 - the value valid between the start and end dates > > -'previous_' => the value valid just before the current value > > -'current_minus_number_' => the value valid 'number' before the current > value > > -'first_plus_number_' => the value valid 'number' after the first value > > > > The validate method of the dimension model should ensure that a particular > record does not violate the SCD rules (i.e. no overlapping > start_dates/end_dates, etc.) > > > > There should also be some support for source systems that already have some > form of history tracking (like acts_as_versioned). > > > > > > > > Ok, so those are my thoughts. Please remember, when I say 'should', I don't > mean that ActiveWarehouse ETL must be done this way. These are all ideas > that have come to me over the past while in working on data warehousing. I > think they might have some merit, but I want to discuss them with the > community before I start flushing them out and working on any of them. > > > > I am really excited about helping to develop ActiveWarehouse. I look forward > to your comments (both positive and negative) > > > > Paul > > > > > > > > > _______________________________________________ > Activewarehouse-discuss mailing list > Activewarehouse-discuss@rubyforge.org > http://rubyforge.org/mailman/listinfo/activewarehouse-discuss > > _______________________________________________ Activewarehouse-discuss mailing list Activewarehouse-discuss@rubyforge.org http://rubyforge.org/mailman/listinfo/activewarehouse-discuss