Alex,

At HackathonFive, You expressed interest in luigi for ETL.

As I mentioned, GROUSE ETL runs in parallel (12 to 24 luigi workers) with the 
200M patient broken into 200 groups. Then tasks are broken down by patient 
group:


class _BeneIdGrouped(luigi.WrapperTask):
    group_tasks = cast(List[Type[CMSRIFUpload]], [])  # abstract

    def requires(self) -> List[luigi.Task]:
        deps = []  # type: List[luigi.Task]
        for group_task in self.group_tasks:
            survey = BeneIdSurvey()
            deps += [survey]
            results = survey.results()
            if results:
                deps += [
                    group_task(
                        group_num=ntile.chunk_num,
                        group_qty=len(results),
                        bene_id_qty=ntile.bene_id_qty,
                        bene_id_first=ntile.bene_id_first,
                        bene_id_last=ntile.bene_id_last)
                    for ntile in results
                ]
        return deps

...

class InpatientStays(_BeneIdGrouped):
    group_tasks = [MEDPAR_Upload, MAXDATA_IP_Upload]



-- https://github.com/kumc-bmi/grouse/blob/master/etl_i2b2/cms_pd.py#L1554-L1573


where MEDPAR_Upload is a luigi task that handles one group of patients (one 
"upload" in the sense of the i2b2 upload_status table).

cms_pd.py has reasonably complete module documentation it was the basis of one 
or two design reviews (along with 
README<https://github.com/kumc-bmi/grouse/blob/master/etl_i2b2/README.md> usage 
notes, 
CONTRIBUTING<https://github.com/kumc-bmi/grouse/blob/master/etl_i2b2/CONTRIBUTING.md>
 design and maintenance notes, and the 
grouse_tables<https://github.com/kumc-bmi/grouse/blob/master/grouse_tables.csv> 
index / cheat-sheet).



--
Dan

_______________________________________________
Gpc-dev mailing list
Gpc-dev@listserv.kumc.edu
http://listserv.kumc.edu/mailman/listinfo/gpc-dev

Reply via email to