Hi Bob, > What's the best way to factor out and reuse portions of the ETL pipeline?
as Darin said, you can conditionally declare sources, transforms and destinations when needed. You can also factor out useful code to a 'common.rb', shared across all your scripts: def some_macro_transform(args...) transform :the_column, :default, :default_value => ''... transform :... transform .. end or again: def parse_french_date(string) Date.strptime(string,'%d/%m/%Y') end to go further, each of my ctl files starts roughly the same way (note the reused logic in bold) table = 'the_table_for_this_ctl' source_file = 'the_data.csv' output_file = *output_file*(source_file) bulk_file = *bulk_file*(source_file) # converted from latin1 to utf-8 preprocessed_file = *preprocessed_file*(source_file) available_fields = *CsvHelper.get_fields_names*(preprocessed_file) The methods or classes you see in bold are reusable, unit-tested assets. To pass conditions to your scripts, one useful trick is that you can pass data to your script using ENV: ENV['datawarehouse_mode'] = 'streaming' # be sure to pick up a non-conflicting name throw "error" unless system("myctl.ctl") (and myctl.ctl would use if ENV['datawarehouse_mode'] == 'streaming' inside). cheers! Thibaut Barrère -- [freelance developer] http://www.logeek.fr [blog] http://evolvingworker.com - tools for a better day [blog] http://blog.logeek.fr - about writing software
_______________________________________________ Activewarehouse-discuss mailing list Activewarehouse-discuss@rubyforge.org http://rubyforge.org/mailman/listinfo/activewarehouse-discuss