New reply on DataCleaner's online discussion forum (http://datacleaner.org/forum):
Kasper Sørensen replied to subject 'Best practice for build complex pipelines?' ------------------- Hi Steve, Thanks for the good question(s). I hope I can guide you in the right direction with these answers. Regarding the country standardizer - actually the things you mention (capitalize, synonym lookup and convert to 3-char code) can _all_ be done by the single Country standardizer component. Unless you have some very odd/specific synonyms which haven't ever hit our built-in dictionary at least. But I imagine that you can decrease complexity a little here by replacing your 3 components with just 1. Now to the bigger question around filtering and processing GBR and DRK records in different ways. If you want to do all this in one job, then you should be able to. But of course underlying your question is also the best practice question about when to do what. Honestly I don't think there's a silver bullet, but maybe at least keep an eye out for ensuring that your jobs are not needlessly complex. To bring back streams together, Union will probably one day be able to do it, but in it's shape today that component is really meant for unioning two source tables. Rather you should take a look at the component called "Fuse / coalesce fields". Fuse/coalesce fields. If you decide that you would rather split up the jobs to make complexity a bit lower, then here's a suggestion: * First make a job which only does country standardization - since this is the field you want to filter on. This job should just apply the country standardizer and then Update table to write the standardized country code back into your data. * Then make a job for each (set of) country code you want to process. Here you can use the Equals filter which you can then apply to the source data - which means that it can optimize the query towards this table by pushing down the equals condition to a WHERE clause. That means you practically don't loose any performance by running two jobs compared to one, because each job only process the relevant records. ------------------- View the topic online to reply - go to http://datacleaner.org/topic/1098/Best-practice-for-build-complex-pipelines%3F -- You received this message because you are subscribed to the Google Groups "DataCleaner-notify" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/datacleaner-notify. For more options, visit https://groups.google.com/d/optout.
