Paul,

Thanks for the deep comments on this type of stuff. I wanted to let
you know that I do plan on responding, however at the moment I am
extremely busy, so please be patient and I will provide my input soon.

V/r
Anthony Eden

On Feb 5, 2008 7:59 PM, Paul Cortens <[EMAIL PROTECTED]> wrote:
>
>
>
>
> Hi,
>
>
>
> I have recently started a project using ActiveWarehouse and ActiveWarehouse
> ETL. When I chose these tools I knew they were very early in their
> implementation; however, I wanted the flexibility that ruby offers so I
> decided that using a "new" technology was worth it.
>
>
>
> Now that I have completed a prototype of my project, I have a lot of
> questions about the architecture of both ActiveWarehouse and ActiveWarehouse
> ETL. I have been thinking quite a lot about how I would design such tools. I
> am curious if my ideas have been considered already, and if so why the
> current implementation was chosen. If my ideas have not been considered
> before, then I would like to hear some feedback from the community before I
> start implementing them.
>
>
>
> Before I get into the (lengthy) list of ideas I have, I should give you some
> background on where these ideas are coming from.
>
>
>
> I have been working at a foreign exchange company since I finished my
> commerce degree 2 years ago. Most of my time there has been spent leading
> the development of their (small) data warehouse. That DW is entirely built
> on Microsoft products. Until I started using ActiveWarehouse, the Microsoft
> tools were my only in-depth exposure to ETL tools. In case you haven't used
> SSIS and SSAS, those products have many great selling features; however
> there are many fatal flaws that contradict all the best practices that the
> rails community encourages. DRY, good version control, and even the most
> basic testing are all nearly impossible to do well with SSIS. On top of
> this, very few SSIS developers seem to care!!! (I know some do.)
>
>
>
> I want to be involved with a development community that encourages best
> practices, so I am trying to avoid Microsoft's DW tools. This (combined with
> what I said before about the flexibility of ruby) brings me to
> ActiveWarehouse—ActiveWarehouse ETL specifically.
>
>
>
> So here are my 'ideas' for how the ideal ETL tool should be designed. Please
> keep in mind that these are just ideas, not recommendations. I would really
> like to generate some good debate. Then, if we can agree on something I will
> start working on an implementation.
>
>
>
>
>
> How does an ETL process (or a Data Warehouse in general) fit into the MVC
> pattern? Dimensions and Facts are obviously models. But what about the ctl
> file that fills in a dimension table using a csv and a few transforms? It
> looks like a controller to me, but maybe it is just a really complex
> constructor method for a model (i.e. def MyDimension.new(csv); insert
> complex code; end). What are the arguments against fitting a DW into the MVC
> pattern?
>
>
>
>
>
> Environments. I LOVE the different environments in rails. I especially love
> how I don't have to think about them. They just work. Any ideas on how this
> should/could be implemented?
>
>
>
>
>
> Testing… This fits right in with environments. I have had several headaches
> from trying to test my ETL processes. I have also recently fallen in love
> with RSpec. It should be really easy to stub/mock the various parts of the
> ETL code: in, out, sk lookups, etc. How do you test your code?
>
>
>
>
>
> Readability. Any thoughts on how to make the ETL code more readable? I know
> it is fairly readable know, but it isn't as readable as ruby/rails code can
> be.
>
>
>
>
>
> SK Lookups. These should be class methods of the dimension model (think
> specialized finders). The id of the dimension is the surrogate key. This (as
> well as the business key) are stored in the dimension table. Therefore, the
> dimension model should handle this logic, right? Maybe we can get an
> sk_lookup method in the ActiveWarehouse::Dimension class. Then we could do
> this in our ETL: MyDimension.sk_lookup(bk) to get the sk. Or we could do
> MyDimension.sk_lookup(bk, effective_date) for SCDs.
>
>
>
> SCDs. This is another place where the main logic should be in the model. I
> want to be able to hand the data to my dimension and it will insert and/or
> update the table accordingly. The ETL process can handle getting the data
> into the dimensional format because it may come from many different places,
> but SCD handling will be the same for all data sources and can be handled by
> the model.
>
>
>
> Convention Over Configuration. I think there is a ton of room for
> time-saving conventions, especially with regards to SCDs.
>
>
>
> start_date, end_date, and enabled columns should be used to track the
> effective dates and which is the current row. I would argue that these
> columns should be created automatically (like id).
>
> bk – there should be a convention for naming the business key. I think the
> column should be required—though the data type may vary, I can't think of a
> dimension that doesn't have a business key.
>
> All columns (except id, bk, start_date, end_date, and enabled) are assumed
> to be type 1 SCDs. This could be overridden by something like `default_scd
> :historic`.
>
> However, columns prefixed by
>
> -'current_' => type 1 - the current value
>
> -'first_' => never change - the first value
>
> -'historic_' => type 2 - the value valid between the start and end dates
>
> -'previous_' => the value valid just before the current value
>
> -'current_minus_number_' => the value valid 'number' before the current
> value
>
> -'first_plus_number_' => the value valid 'number' after the first value
>
>
>
> The validate method of the dimension model should ensure that a particular
> record does not violate the SCD rules (i.e. no overlapping
> start_dates/end_dates, etc.)
>
>
>
> There should also be some support for source systems that already have some
> form of history tracking (like acts_as_versioned).
>
>
>
>
>
>
>
> Ok, so those are my thoughts. Please remember, when I say 'should', I don't
> mean that ActiveWarehouse ETL must be done this way. These are all ideas
> that have come to me over the past while in working on data warehousing. I
> think they might have some merit, but I want to discuss them with the
> community before I start flushing them out and working on any of them.
>
>
>
> I am really excited about helping to develop ActiveWarehouse. I look forward
> to your comments (both positive and negative)
>
>
>
> Paul
>
>
>
>
>
>
>
>
> _______________________________________________
> Activewarehouse-discuss mailing list
> Activewarehouse-discuss@rubyforge.org
> http://rubyforge.org/mailman/listinfo/activewarehouse-discuss
>
>
_______________________________________________
Activewarehouse-discuss mailing list
Activewarehouse-discuss@rubyforge.org
http://rubyforge.org/mailman/listinfo/activewarehouse-discuss

Reply via email to