[DataCleaner-notify] Re: [datacleaner.org] Best practice for build complex pipelines?

Kasper Sørensen Thu, 17 Dec 2015 22:50:41 -0800

New reply on DataCleaner's online discussion forum 
(http://datacleaner.org/forum):


Kasper Sørensen replied to subject 'Best practice for build complex pipelines?'

-------------------

Hi Steve,

Thanks for the good question(s). I hope I can guide you in the right direction 
with these answers.

Regarding the country standardizer - actually the things you mention 
(capitalize, synonym lookup and convert to 3-char code) can _all_ be done by 
the single Country standardizer component. Unless you have some very 
odd/specific synonyms which haven't ever hit our built-in dictionary at least. 
But I imagine that you can decrease complexity a little here by replacing your 
3 components with just 1.

Now to the bigger question around filtering and processing GBR and DRK records 
in different ways. If you want to do all this in one job, then you should be 
able to. But of course underlying your question is also the best practice 
question about when to do what. Honestly I don't think there's a silver bullet, 
but maybe at least keep an eye out for ensuring that your jobs are not 
needlessly complex.

To bring back streams together, Union will probably one day be able to do it, 
but in it's shape today that component is really meant for unioning two source 
tables. Rather you should take a look at the component called "Fuse / coalesce 
fields". 
Fuse/coalesce fields.

If you decide that you would rather split up the jobs to make complexity a bit 
lower, then here's a suggestion:

 * First make a job which only does country standardization - since this is the 
field you want to filter on. This job should just apply the country 
standardizer and then Update table to write the standardized country code back 
into your data.
 * Then make a job for each (set of) country code you want to process. Here you 
can use the Equals filter which you can then apply to the source data - which 
means that it can optimize the query towards this table by pushing down the 
equals condition to a WHERE clause. That means you practically don't loose any 
performance by running two jobs compared to one, because each job only process 
the relevant records.

-------------------

View the topic online to reply - go to 
http://datacleaner.org/topic/1098/Best-practice-for-build-complex-pipelines%3F

-- 
You received this message because you are subscribed to the Google Groups 
"DataCleaner-notify" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/datacleaner-notify.
For more options, visit https://groups.google.com/d/optout.

[DataCleaner-notify] Re: [datacleaner.org] Best practice for build complex pipelines?

Reply via email to