Hi,

 

I have recently started a project using ActiveWarehouse and ActiveWarehouse
ETL. When I chose these tools I knew they were very early in their
implementation; however, I wanted the flexibility that ruby offers so I
decided that using a "new" technology was worth it.

 

Now that I have completed a prototype of my project, I have a lot of
questions about the architecture of both ActiveWarehouse and ActiveWarehouse
ETL. I have been thinking quite a lot about how I would design such tools. I
am curious if my ideas have been considered already, and if so why the
current implementation was chosen. If my ideas have not been considered
before, then I would like to hear some feedback from the community before I
start implementing them.

 

Before I get into the (lengthy) list of ideas I have, I should give you some
background on where these ideas are coming from.

 

I have been working at a foreign exchange company since I finished my
commerce degree 2 years ago. Most of my time there has been spent leading
the development of their (small) data warehouse. That DW is entirely built
on Microsoft products. Until I started using ActiveWarehouse, the Microsoft
tools were my only in-depth exposure to ETL tools. In case you haven't used
SSIS and SSAS, those products have many great selling features; however
there are many fatal flaws that contradict all the best practices that the
rails community encourages. DRY, good version control, and even the most
basic testing are all nearly impossible to do well with SSIS. On top of
this, very few SSIS developers seem to care!!! (I know some do.)

 

I want to be involved with a development community that encourages best
practices, so I am trying to avoid Microsoft's DW tools. This (combined with
what I said before about the flexibility of ruby) brings me to
ActiveWarehouse-ActiveWarehouse ETL specifically.

 

So here are my 'ideas' for how the ideal ETL tool should be designed. Please
keep in mind that these are just ideas, not recommendations. I would really
like to generate some good debate. Then, if we can agree on something I will
start working on an implementation.

 

 

How does an ETL process (or a Data Warehouse in general) fit into the MVC
pattern? Dimensions and Facts are obviously models. But what about the ctl
file that fills in a dimension table using a csv and a few transforms? It
looks like a controller to me, but maybe it is just a really complex
constructor method for a model (i.e. def MyDimension.new(csv); insert
complex code; end). What are the arguments against fitting a DW into the MVC
pattern?

 

 

Environments. I LOVE the different environments in rails. I especially love
how I don't have to think about them. They just work. Any ideas on how this
should/could be implemented?

 

 

Testing. This fits right in with environments. I have had several headaches
from trying to test my ETL processes. I have also recently fallen in love
with RSpec. It should be really easy to stub/mock the various parts of the
ETL code: in, out, sk lookups, etc. How do you test your code?

 

 

Readability. Any thoughts on how to make the ETL code more readable? I know
it is fairly readable know, but it isn't as readable as ruby/rails code can
be.

 

 

SK Lookups. These should be class methods of the dimension model (think
specialized finders). The id of the dimension is the surrogate key. This (as
well as the business key) are stored in the dimension table. Therefore, the
dimension model should handle this logic, right? Maybe we can get an
sk_lookup method in the ActiveWarehouse::Dimension class. Then we could do
this in our ETL: MyDimension.sk_lookup(bk) to get the sk. Or we could do
MyDimension.sk_lookup(bk, effective_date) for SCDs.

 

SCDs. This is another place where the main logic should be in the model. I
want to be able to hand the data to my dimension and it will insert and/or
update the table accordingly. The ETL process can handle getting the data
into the dimensional format because it may come from many different places,
but SCD handling will be the same for all data sources and can be handled by
the model.

 

Convention Over Configuration. I think there is a ton of room for
time-saving conventions, especially with regards to SCDs. 

 

start_date, end_date, and enabled columns should be used to track the
effective dates and which is the current row. I would argue that these
columns should be created automatically (like id).

bk - there should be a convention for naming the business key. I think the
column should be required-though the data type may vary, I can't think of a
dimension that doesn't have a business key.

All columns (except id, bk, start_date, end_date, and enabled) are assumed
to be type 1 SCDs. This could be overridden by something like `default_scd
:historic`.

However, columns prefixed by

-'current_' => type 1 - the current value

-'first_' => never change - the first value

-'historic_' => type 2 - the value valid between the start and end dates

-'previous_' => the value valid just before the current value

-'current_minus_number_' => the value valid 'number' before the current
value

-'first_plus_number_' => the value valid 'number' after the first value

 

The validate method of the dimension model should ensure that a particular
record does not violate the SCD rules (i.e. no overlapping
start_dates/end_dates, etc.)

 

There should also be some support for source systems that already have some
form of history tracking (like acts_as_versioned).

 

 

 

Ok, so those are my thoughts. Please remember, when I say 'should', I don't
mean that ActiveWarehouse ETL must be done this way. These are all ideas
that have come to me over the past while in working on data warehousing. I
think they might have some merit, but I want to discuss them with the
community before I start flushing them out and working on any of them.

 

I am really excited about helping to develop ActiveWarehouse. I look forward
to your comments (both positive and negative)

 

Paul

 

 

 

 

_______________________________________________
Activewarehouse-discuss mailing list
Activewarehouse-discuss@rubyforge.org
http://rubyforge.org/mailman/listinfo/activewarehouse-discuss

Reply via email to