Re: Run time

Buuren, S. (Stef) van Tue, 09 May 2017 05:17:58 -0700

Run time is becoming very important nowadays. We have been able to run mice() 
to impute dataset of 2 million records and 20 variables, but we needed to break 
up the data in subsets, and then glue it together in the end.


Some relevant posts:

https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_24040280_parallel-2Dcomputation-2Dof-2Dmultiple-2Dimputation-2Dby-2Dusing-2Dmice-2Dr-2Dpackage&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=SQbOioBMxkJLCNpYR6bLI2MN0Jq63LQShFmPN4BnU_8&e=
 
https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_26154260_has-2Danyone-2Dtried-2Dto-2Dparallelize-2Dmultiple-2Dimputation-2Din-2Dmice-2Dpackage&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=q55tQ2N3UYsm6779-kdiwf_zPtF3BEgOTVt0UvMsDLM&e=
 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_griverorz_parmice&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=JtmHp0l7Je0teLRfIi_jGSSqO4U1M8F5OLFC0AvH_Mc&e=
 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gerkovink_parlMICE_blob_master_parlMICE.R&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=Ga9lyGivfY8L3frZlP-91yOi55e7pGI542tI_npZews&e=
 

The mice algorithm itself can be quite fast, but it cannot use the sweep 
operator (which would result in even faster imputes). The core of mice was 
written in 1999, and – of course – does not include modern’ day niceties like 
dplyr, parallel, data.table or other. Anyone wishing to speed up the core is 
welcome to provide pull requests at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_stefvanbuuren_mice&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=STURKMp4HHpNj_5geh2Iq9j_oJvVUgCjYRmtz0UvcTU&e=
  .

Hope this helps, Stef.


Van: Impute -- Imputations in Data Analysis 
<[email protected]> namens "William E Winkler (CENSUS/CSRM 
FED)" <[email protected]>
Beantwoorden - Aan: Impute -- Imputations in Data Analysis 
<[email protected]>
Datum: maandag 8 mei 2017 20:45
Aan: "[email protected]" <[email protected]>
Onderwerp: Re: Run time


Paul.  I do not think that run-time optimization is an issue with most 
commercial software vendors.



https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_profile_view-3Fid-3DADcAAAHBP9UB-2D04Ux69OvAdWLeOpWdMv1Su8BjY-26authType-3Dname-26authToken-3D2nFb&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=SNdG-DSzGrB8aP1xLoA3OwaGtsPrleLq7nq5FwuYJI8&e=
 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_profile_view-3Fid-3DADcAAAHBP9UB-2D04Ux69OvAdWLeOpWdMv1Su8BjY-26authType-3Dname-26authToken-3D2nFb&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=GeCfQCjK0rwPyOR_tu66TXEimZY2Tk54MEnCTHc16Bc&s=MXxC2PFCMEzT805u-_iToh74fqNsS972FLnPLeqhBms&e=>A
 key aspect of the generalized systems is increasing the speed of certain 
components by factors of 100-1000. The methods are based on new mathematics and 
computational algorithms. Most methods of parallel computing will not yield 
speed-ups. These new algorithms are likelihood computations and certain 
optimization as in integer programing that are well known not to be suitable 
for parallel computation in general.

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.researchgate.net_publication_315769374-5FExamples-5Fof-5FComputational-5FSpeed-5FImprovements-5Ffor-5FInstitutional-5FUse-5Fof-5FGeneralized-5FSoftware&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=P53Ry4kz5IdRl8ckjCI20LTJD6gSr-KugdgmQvXhPHg&e=
 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_redir_redirect-3Furl-3Dhttps-253A-252F-252Fwww-252Eresearchgate-252Enet-252Fpublication-252F315769374-5FExamples-5Fof-5FComputational-5FSpeed-5FImprovements-5Ffor-5FInstitutional-5FUse-5Fof-5FGeneralized-5FSoftware-26urlhash-3DSzrU-26-5Ft-3Dtracking-5Fanet&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=GeCfQCjK0rwPyOR_tu66TXEimZY2Tk54MEnCTHc16Bc&s=9vy4s46iRspnwpWz3BLxRO03pHnUL8c-5jciSoHyRCI&e=>



This is a talk I gave at the IEEE International Conference on Data Mining and 
later at the Isaac Newton Institute.

https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_dinaworkshop2015_invited-2Dspeakers&d=DwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=hcUpjuT-4mT4-Jj_KbLFI1UqObrWon9qFVF8QoeySfU&s=KDz-L37tNTrp8p4m05UOM2JZG7gb_clpE1CB0OwI-iQ&e=
 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__sites.google.com_site_dinaworkshop2015_invited-2Dspeakers&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=GeCfQCjK0rwPyOR_tu66TXEimZY2Tk54MEnCTHc16Bc&s=B9jogLdiWLBEQl4BMhdtxzCoLAj5vXvyoofoc2VyHBo&e=>


The emphasis is on speed.

Regards.  Bill
________________________________
From: Impute -- Imputations in Data Analysis 
<[email protected]> on behalf of Paul von Hippel 
<[email protected]>
Sent: Monday, May 8, 2017 2:35 PM
To: [email protected]
Subject: Re: Run time

Thank you. There seems to be a disconnect between the runtimes discussed in the 
literature and the runtimes experienced by end users. For example, when 
imputing large datasets it is not unusual for users to report that Stata's mi 
impute commands take hours or even days. This is not because the users are 
doing anything wrong. The mi impute command doesn't scale well, and I'm not 
sure how much this is due to the algorithms (EM, MCMC, chained) or to Stata's 
habit of keeping more data than necessary in memory.
Earlier version of SAS PROC MI were often slow as well. But PROC MI seems to 
improved, and in large data it's certainly a lot faster than Stata's mi impute. 
I'm not sure whether SAS has improved its code or simply benefited from 
increases in processor speed and memory.
It's important to recognize runtime as a legitimate concern of end users. Long 
runtimes discourage use of MI.

Best wishes,
Paul von Hippel
LBJ School of Public Affairs
Sid Richardson Hall 3.251
University of Texas, Austin
2315 Red River, Box Y
Austin, TX  78712
(512) 537-8112

On Mon, May 8, 2017 at 1:23 PM, William E Winkler (CENSUS/CSRM FED) 
<[email protected]<mailto:[email protected]>> wrote:

Paul.

This may be a somewhat appropriate comparison in the case of EM-based methods.  
In my edit/imputation course, I begin by running an EM algorithm for a 
600,000-cell contingency table in 200 iterations with epsilon 10^-12 in 45 
seconds on a basic Core I7 PC.  I challenge the students to write a comparable 
algorithm in R or SAS that converges in less than one week.  If the time to 
produce the model is x and you want N copies of the output plus the additional 
processing for MI, then the total time is approximately x + y where y is a 
relatively quite small amount of time to finalize the MI part.  Drawing N 
copies from the model takes a fraction of the time to create the model.



Another timing is given at the end.  The EM-based methods are compared to the 
full Bayesian methods in the two JASA papers.  The EM-type methods can be used 
to draw multiple copies of the data if necessary.  The Bayesian methods are 
generally superior.  The compute-intense part of the methods are the creation 
of the limiting distributions (models).  Drawing extra copies from the models 
can be very fast.

Regards.  Bill



   There have been recent major advances in edit/imputation that compare 
favorably with our edit methods (Winkler 1995, 1997ab, 2003, 2008, 2010).  The 
methods had previously been powerful and have been the fastest in the world.

　

    Kim, Kim, H., Cox, L. H., Karr, A. F., Reiter, J. P., Wang, Q. (2014), 
Simultaneous Edit-Imputation for Continuous Microdata,, JASA, 110, 987-999..

     Manrique-Vallier, D. and Reiter, J. (2017), Bayesian Simultaneous Edit and 
Imputation for Multivariate Categorical Data. Journal of the American 
Statistical Association, online version available September 16, 2016,

　Fout! Bestandsnaam niet opgegeven.ManriqueVallierReiterBayesian Simultaneous 
Edit and Imputation for Multivariate Categorical 
DataJASA16.pdf<https://urldefense.proofpoint.com/v2/url?u=https-3A__collab.ecm.census.gov_dir_adrm_RMStratPlan2017_SiteAssets_Lists_Topic-25204_EditForm_ManriqueVallierReiterBayesian-2520Simultaneous-2520Edit-2520and-2520Imputation-2520for-2520Multivariate-2520Categorical-2520DataJASA16.pdf&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=So_ZZMrM_GxwnPrlnqQDOym4WxRUkMZZ1Jjz4vx3tcU&s=GNoBy3919mSAQBpOBz4Cfg8TH-2GgrMA0GKaEWnpX5M&e=>





  The 2017 paper improves on the preservation of joint distributions over our 
methods but needs some additional enhancements. Their methods are 2000-8000 
times as slow as our methods. Using the DISCRETE system (Winkler 1997, 2003, 
2008, 2010) on a server with 20 cpus, we can process the Decennial short-form 
in less than twelve hours.



________________________________
From: Impute -- Imputations in Data Analysis 
<[email protected]<mailto:[email protected]>>
 on behalf of Paul von Hippel 
<[email protected]<mailto:[email protected]>>
Sent: Monday, May 8, 2017 12:41 PM
To: 
[email protected]<mailto:[email protected]>
Subject: Run time

Does anyone know of work on the run time of different MI algorithm? Every MI 
user knows that some MI software can be slow in large datasets, but it's not 
something I've seen discussed in the MI literature.

Best wishes,
Paul von Hippel
LBJ School of Public Affairs
Sid Richardson Hall 3.251
University of Texas, Austin
2315 Red River, Box Y
Austin, TX  78712
(512) 537-8112

This message may contain information that is not intended for you. If you are 
not the addressee or if this message was sent to you by mistake, you are 
requested to inform the sender and delete the message. TNO accepts no liability 
for the content of this e-mail, for the manner in which you use it and for 
damage of any kind resulting from the risks inherent to the electronic 
transmission of messages.

Re: Run time

Reply via email to