Hi,
I have recently started a project using ActiveWarehouse and ActiveWarehouse ETL. When I chose these tools I knew they were very early in their implementation; however, I wanted the flexibility that ruby offers so I decided that using a "new" technology was worth it. Now that I have completed a prototype of my project, I have a lot of questions about the architecture of both ActiveWarehouse and ActiveWarehouse ETL. I have been thinking quite a lot about how I would design such tools. I am curious if my ideas have been considered already, and if so why the current implementation was chosen. If my ideas have not been considered before, then I would like to hear some feedback from the community before I start implementing them. Before I get into the (lengthy) list of ideas I have, I should give you some background on where these ideas are coming from. I have been working at a foreign exchange company since I finished my commerce degree 2 years ago. Most of my time there has been spent leading the development of their (small) data warehouse. That DW is entirely built on Microsoft products. Until I started using ActiveWarehouse, the Microsoft tools were my only in-depth exposure to ETL tools. In case you haven't used SSIS and SSAS, those products have many great selling features; however there are many fatal flaws that contradict all the best practices that the rails community encourages. DRY, good version control, and even the most basic testing are all nearly impossible to do well with SSIS. On top of this, very few SSIS developers seem to care!!! (I know some do.) I want to be involved with a development community that encourages best practices, so I am trying to avoid Microsoft's DW tools. This (combined with what I said before about the flexibility of ruby) brings me to ActiveWarehouse-ActiveWarehouse ETL specifically. So here are my 'ideas' for how the ideal ETL tool should be designed. Please keep in mind that these are just ideas, not recommendations. I would really like to generate some good debate. Then, if we can agree on something I will start working on an implementation. How does an ETL process (or a Data Warehouse in general) fit into the MVC pattern? Dimensions and Facts are obviously models. But what about the ctl file that fills in a dimension table using a csv and a few transforms? It looks like a controller to me, but maybe it is just a really complex constructor method for a model (i.e. def MyDimension.new(csv); insert complex code; end). What are the arguments against fitting a DW into the MVC pattern? Environments. I LOVE the different environments in rails. I especially love how I don't have to think about them. They just work. Any ideas on how this should/could be implemented? Testing. This fits right in with environments. I have had several headaches from trying to test my ETL processes. I have also recently fallen in love with RSpec. It should be really easy to stub/mock the various parts of the ETL code: in, out, sk lookups, etc. How do you test your code? Readability. Any thoughts on how to make the ETL code more readable? I know it is fairly readable know, but it isn't as readable as ruby/rails code can be. SK Lookups. These should be class methods of the dimension model (think specialized finders). The id of the dimension is the surrogate key. This (as well as the business key) are stored in the dimension table. Therefore, the dimension model should handle this logic, right? Maybe we can get an sk_lookup method in the ActiveWarehouse::Dimension class. Then we could do this in our ETL: MyDimension.sk_lookup(bk) to get the sk. Or we could do MyDimension.sk_lookup(bk, effective_date) for SCDs. SCDs. This is another place where the main logic should be in the model. I want to be able to hand the data to my dimension and it will insert and/or update the table accordingly. The ETL process can handle getting the data into the dimensional format because it may come from many different places, but SCD handling will be the same for all data sources and can be handled by the model. Convention Over Configuration. I think there is a ton of room for time-saving conventions, especially with regards to SCDs. start_date, end_date, and enabled columns should be used to track the effective dates and which is the current row. I would argue that these columns should be created automatically (like id). bk - there should be a convention for naming the business key. I think the column should be required-though the data type may vary, I can't think of a dimension that doesn't have a business key. All columns (except id, bk, start_date, end_date, and enabled) are assumed to be type 1 SCDs. This could be overridden by something like `default_scd :historic`. However, columns prefixed by -'current_' => type 1 - the current value -'first_' => never change - the first value -'historic_' => type 2 - the value valid between the start and end dates -'previous_' => the value valid just before the current value -'current_minus_number_' => the value valid 'number' before the current value -'first_plus_number_' => the value valid 'number' after the first value The validate method of the dimension model should ensure that a particular record does not violate the SCD rules (i.e. no overlapping start_dates/end_dates, etc.) There should also be some support for source systems that already have some form of history tracking (like acts_as_versioned). Ok, so those are my thoughts. Please remember, when I say 'should', I don't mean that ActiveWarehouse ETL must be done this way. These are all ideas that have come to me over the past while in working on data warehousing. I think they might have some merit, but I want to discuss them with the community before I start flushing them out and working on any of them. I am really excited about helping to develop ActiveWarehouse. I look forward to your comments (both positive and negative) Paul
_______________________________________________ Activewarehouse-discuss mailing list Activewarehouse-discuss@rubyforge.org http://rubyforge.org/mailman/listinfo/activewarehouse-discuss