Run time is becoming very important nowadays. We have been able to run mice() to impute dataset of 2 million records and 20 variables, but we needed to break up the data in subsets, and then glue it together in the end.
Some relevant posts: https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_24040280_parallel-2Dcomputation-2Dof-2Dmultiple-2Dimputation-2Dby-2Dusing-2Dmice-2Dr-2Dpackage&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=SQbOioBMxkJLCNpYR6bLI2MN0Jq63LQShFmPN4BnU_8&e= https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_26154260_has-2Danyone-2Dtried-2Dto-2Dparallelize-2Dmultiple-2Dimputation-2Din-2Dmice-2Dpackage&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=q55tQ2N3UYsm6779-kdiwf_zPtF3BEgOTVt0UvMsDLM&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_griverorz_parmice&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=JtmHp0l7Je0teLRfIi_jGSSqO4U1M8F5OLFC0AvH_Mc&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gerkovink_parlMICE_blob_master_parlMICE.R&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=Ga9lyGivfY8L3frZlP-91yOi55e7pGI542tI_npZews&e= The mice algorithm itself can be quite fast, but it cannot use the sweep operator (which would result in even faster imputes). The core of mice was written in 1999, and – of course – does not include modern’ day niceties like dplyr, parallel, data.table or other. Anyone wishing to speed up the core is welcome to provide pull requests at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_stefvanbuuren_mice&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=STURKMp4HHpNj_5geh2Iq9j_oJvVUgCjYRmtz0UvcTU&e= . Hope this helps, Stef. Van: Impute -- Imputations in Data Analysis <[email protected]> namens "William E Winkler (CENSUS/CSRM FED)" <[email protected]> Beantwoorden - Aan: Impute -- Imputations in Data Analysis <[email protected]> Datum: maandag 8 mei 2017 20:45 Aan: "[email protected]" <[email protected]> Onderwerp: Re: Run time Paul. I do not think that run-time optimization is an issue with most commercial software vendors. https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_profile_view-3Fid-3DADcAAAHBP9UB-2D04Ux69OvAdWLeOpWdMv1Su8BjY-26authType-3Dname-26authToken-3D2nFb&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=SNdG-DSzGrB8aP1xLoA3OwaGtsPrleLq7nq5FwuYJI8&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_profile_view-3Fid-3DADcAAAHBP9UB-2D04Ux69OvAdWLeOpWdMv1Su8BjY-26authType-3Dname-26authToken-3D2nFb&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=GeCfQCjK0rwPyOR_tu66TXEimZY2Tk54MEnCTHc16Bc&s=MXxC2PFCMEzT805u-_iToh74fqNsS972FLnPLeqhBms&e=>A key aspect of the generalized systems is increasing the speed of certain components by factors of 100-1000. The methods are based on new mathematics and computational algorithms. Most methods of parallel computing will not yield speed-ups. These new algorithms are likelihood computations and certain optimization as in integer programing that are well known not to be suitable for parallel computation in general. https://urldefense.proofpoint.com/v2/url?u=https-3A__www.researchgate.net_publication_315769374-5FExamples-5Fof-5FComputational-5FSpeed-5FImprovements-5Ffor-5FInstitutional-5FUse-5Fof-5FGeneralized-5FSoftware&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=P53Ry4kz5IdRl8ckjCI20LTJD6gSr-KugdgmQvXhPHg&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_redir_redirect-3Furl-3Dhttps-253A-252F-252Fwww-252Eresearchgate-252Enet-252Fpublication-252F315769374-5FExamples-5Fof-5FComputational-5FSpeed-5FImprovements-5Ffor-5FInstitutional-5FUse-5Fof-5FGeneralized-5FSoftware-26urlhash-3DSzrU-26-5Ft-3Dtracking-5Fanet&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=GeCfQCjK0rwPyOR_tu66TXEimZY2Tk54MEnCTHc16Bc&s=9vy4s46iRspnwpWz3BLxRO03pHnUL8c-5jciSoHyRCI&e=> This is a talk I gave at the IEEE International Conference on Data Mining and later at the Isaac Newton Institute. https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_dinaworkshop2015_invited-2Dspeakers&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=KDz-L37tNTrp8p4m05UOM2JZG7gb_clpE1CB0OwI-iQ&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_dinaworkshop2015_invited-2Dspeakers&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=GeCfQCjK0rwPyOR_tu66TXEimZY2Tk54MEnCTHc16Bc&s=B9jogLdiWLBEQl4BMhdtxzCoLAj5vXvyoofoc2VyHBo&e=> The emphasis is on speed. Regards. Bill ________________________________ From: Impute -- Imputations in Data Analysis <[email protected]> on behalf of Paul von Hippel <[email protected]> Sent: Monday, May 8, 2017 2:35 PM To: [email protected] Subject: Re: Run time Thank you. There seems to be a disconnect between the runtimes discussed in the literature and the runtimes experienced by end users. For example, when imputing large datasets it is not unusual for users to report that Stata's mi impute commands take hours or even days. This is not because the users are doing anything wrong. The mi impute command doesn't scale well, and I'm not sure how much this is due to the algorithms (EM, MCMC, chained) or to Stata's habit of keeping more data than necessary in memory. Earlier version of SAS PROC MI were often slow as well. But PROC MI seems to improved, and in large data it's certainly a lot faster than Stata's mi impute. I'm not sure whether SAS has improved its code or simply benefited from increases in processor speed and memory. It's important to recognize runtime as a legitimate concern of end users. Long runtimes discourage use of MI. Best wishes, Paul von Hippel LBJ School of Public Affairs Sid Richardson Hall 3.251 University of Texas, Austin 2315 Red River, Box Y Austin, TX 78712 (512) 537-8112 On Mon, May 8, 2017 at 1:23 PM, William E Winkler (CENSUS/CSRM FED) <[email protected]<mailto:[email protected]>> wrote: Paul. This may be a somewhat appropriate comparison in the case of EM-based methods. In my edit/imputation course, I begin by running an EM algorithm for a 600,000-cell contingency table in 200 iterations with epsilon 10^-12 in 45 seconds on a basic Core I7 PC. I challenge the students to write a comparable algorithm in R or SAS that converges in less than one week. If the time to produce the model is x and you want N copies of the output plus the additional processing for MI, then the total time is approximately x + y where y is a relatively quite small amount of time to finalize the MI part. Drawing N copies from the model takes a fraction of the time to create the model. Another timing is given at the end. The EM-based methods are compared to the full Bayesian methods in the two JASA papers. The EM-type methods can be used to draw multiple copies of the data if necessary. The Bayesian methods are generally superior. The compute-intense part of the methods are the creation of the limiting distributions (models). Drawing extra copies from the models can be very fast. Regards. Bill There have been recent major advances in edit/imputation that compare favorably with our edit methods (Winkler 1995, 1997ab, 2003, 2008, 2010). The methods had previously been powerful and have been the fastest in the world. Kim, Kim, H., Cox, L. H., Karr, A. F., Reiter, J. P., Wang, Q. (2014), Simultaneous Edit-Imputation for Continuous Microdata,, JASA, 110, 987-999.. Manrique-Vallier, D. and Reiter, J. (2017), Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data. Journal of the American Statistical Association, online version available September 16, 2016, Fout! Bestandsnaam niet opgegeven.ManriqueVallierReiterBayesian Simultaneous Edit and Imputation for Multivariate Categorical DataJASA16.pdf<https://urldefense.proofpoint.com/v2/url?u=https-3A__collab.ecm.census.gov_dir_adrm_RMStratPlan2017_SiteAssets_Lists_Topic-25204_EditForm_ManriqueVallierReiterBayesian-2520Simultaneous-2520Edit-2520and-2520Imputation-2520for-2520Multivariate-2520Categorical-2520DataJASA16.pdf&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=So_ZZMrM_GxwnPrlnqQDOym4WxRUkMZZ1Jjz4vx3tcU&s=GNoBy3919mSAQBpOBz4Cfg8TH-2GgrMA0GKaEWnpX5M&e=> The 2017 paper improves on the preservation of joint distributions over our methods but needs some additional enhancements. Their methods are 2000-8000 times as slow as our methods. Using the DISCRETE system (Winkler 1997, 2003, 2008, 2010) on a server with 20 cpus, we can process the Decennial short-form in less than twelve hours. ________________________________ From: Impute -- Imputations in Data Analysis <[email protected]<mailto:[email protected]>> on behalf of Paul von Hippel <[email protected]<mailto:[email protected]>> Sent: Monday, May 8, 2017 12:41 PM To: [email protected]<mailto:[email protected]> Subject: Run time Does anyone know of work on the run time of different MI algorithm? Every MI user knows that some MI software can be slow in large datasets, but it's not something I've seen discussed in the MI literature. Best wishes, Paul von Hippel LBJ School of Public Affairs Sid Richardson Hall 3.251 University of Texas, Austin 2315 Red River, Box Y Austin, TX 78712 (512) 537-8112 This message may contain information that is not intended for you. If you are not the addressee or if this message was sent to you by mistake, you are requested to inform the sender and delete the message. TNO accepts no liability for the content of this e-mail, for the manner in which you use it and for damage of any kind resulting from the risks inherent to the electronic transmission of messages.
