Hi Bob,

> What's the best way to factor out and reuse portions of the ETL pipeline?

as Darin said, you can conditionally declare sources, transforms and
destinations when needed.

You can also factor out useful code to a 'common.rb', shared across all your
scripts:

def some_macro_transform(args...)
  transform :the_column, :default, :default_value => ''...
  transform :...
  transform ..
end

or again:

def parse_french_date(string)
  Date.strptime(string,'%d/%m/%Y')
end

to go further, each of my ctl files starts roughly the same way (note the
reused logic in bold)

table = 'the_table_for_this_ctl'
source_file = 'the_data.csv'
output_file = *output_file*(source_file)
bulk_file = *bulk_file*(source_file)

# converted from latin1 to utf-8
preprocessed_file = *preprocessed_file*(source_file)
available_fields = *CsvHelper.get_fields_names*(preprocessed_file)

The methods or classes you see in bold are reusable, unit-tested assets.

To pass conditions to your scripts, one useful trick is that you can pass
data to your script using ENV:

ENV['datawarehouse_mode'] = 'streaming' # be sure to pick up a
non-conflicting name
throw "error" unless system("myctl.ctl")

(and myctl.ctl would use if ENV['datawarehouse_mode'] == 'streaming'
inside).

cheers!

Thibaut Barrère
--
[freelance developer] http://www.logeek.fr
[blog] http://evolvingworker.com - tools for a better day
[blog] http://blog.logeek.fr - about writing software
_______________________________________________
Activewarehouse-discuss mailing list
Activewarehouse-discuss@rubyforge.org
http://rubyforge.org/mailman/listinfo/activewarehouse-discuss

Reply via email to