Re: Run time

Paul von Hippel Mon, 08 May 2017 11:36:24 -0700

Thank you. There seems to be a disconnect between the runtimes discussed in
the literature and the runtimes experienced by end users. For example, when
imputing large datasets it is not unusual for users to report that Stata's *mi
impute* commands take hours or even days. This is not because the users are
doing anything wrong. The *mi impute* command doesn't scale well, and I'm
not sure how much this is due to the algorithms (EM, MCMC, chained) or to
Stata's habit of keeping more data than necessary in memory.


Earlier version of SAS PROC MI were often slow as well. But PROC MI seems
to improved, and in large data it's certainly a lot faster than Stata's *mi
impute*. I'm not sure whether SAS has improved its code or simply benefited
from increases in processor speed and memory.

It's important to recognize runtime as a legitimate concern of end users.
Long runtimes discourage use of MI.

Best wishes,
Paul von Hippel
LBJ School of Public Affairs
Sid Richardson Hall 3.251
University of Texas, Austin
2315 Red River, Box Y
Austin, TX  78712
(512) 537-8112

On Mon, May 8, 2017 at 1:23 PM, William E Winkler (CENSUS/CSRM FED) <
[email protected]> wrote:

> Paul.
>
> This may be a somewhat appropriate comparison in the case of EM-based
> methods.  In my edit/imputation course, I begin by running an EM algorithm
> for a 600,000-cell contingency table in 200 iterations with epsilon 10^-12
> in 45 seconds on a basic Core I7 PC.  I challenge the students to write a
> comparable algorithm in R or SAS that converges in less than one week.  If
> the time to produce the model is x and you want N copies of the output plus
> the additional processing for MI, then the total time is approximately x +
> y where y is a relatively quite small amount of time to finalize the MI
> part.  Drawing N copies from the model takes a fraction of the time to
> create the model.
>
>
> Another timing is given at the end.  The EM-based methods are compared to
> the full Bayesian methods in the two JASA papers.  The EM-type methods can
> be used to draw multiple copies of the data if necessary.  The Bayesian
> methods are generally superior.  The compute-intense part of the methods
> are the creation of the limiting distributions (models).  Drawing extra
> copies from the models can be very fast.
>
> Regards.  Bill
>
>
>    There have been recent major advances in edit/imputation that compare
> favorably with our edit methods (Winkler 1995, 1997ab, 2003, 2008, 2010).
> The methods had previously been powerful and have been the fastest in the
> world.
>
>     Kim, Kim, H., Cox, L. H., Karr, A. F., Reiter, J. P., Wang, Q. (2014),
> Simultaneous Edit-Imputation for Continuous Microdata,, JASA, 110, 987-999..
>
>      Manrique-Vallier, D. and Reiter, J. (2017), Bayesian Simultaneous
> Edit and Imputation for Multivariate Categorical Data. *Journal of the
> American Statistical Association*, online version available September 16,
> 2016,
>
> ManriqueVallierReiterBayesian Simultaneous Edit and Imputation for
> Multivariate Categorical DataJASA16.pdf
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__collab.ecm.census.gov_dir_adrm_RMStratPlan2017_SiteAssets_Lists_Topic-25204_EditForm_ManriqueVallierReiterBayesian-2520Simultaneous-2520Edit-2520and-2520Imputation-2520for-2520Multivariate-2520Categorical-2520DataJASA16.pdf&d=DwMFJg&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=N9mDDDuK1isnKK-Q36bwNuZl066Rn4cNQtKxtVKMnWBnZ5yXlXHty3gF6wWXsdE6&m=So_ZZMrM_GxwnPrlnqQDOym4WxRUkMZZ1Jjz4vx3tcU&s=GNoBy3919mSAQBpOBz4Cfg8TH-2GgrMA0GKaEWnpX5M&e=>
>
>
>
>
>   The 2017 paper improves on the preservation of joint distributions over
> our methods but needs some additional enhancements. Their methods are
> 2000-8000 times as slow as our methods. Using the DISCRETE system (Winkler
> 1997, 2003, 2008, 2010) on a server with 20 cpus, we can process the
> Decennial short-form in less than twelve hours.
>
>
>
>
> ------------------------------
> *From:* Impute -- Imputations in Data Analysis <[email protected].
> NORTHWESTERN.EDU> on behalf of Paul von Hippel <
> [email protected]>
> *Sent:* Monday, May 8, 2017 12:41 PM
> *To:* [email protected]
> *Subject:* Run time
>
> Does anyone know of work on the run time of different MI algorithm? Every
> MI user knows that some MI software can be slow in large datasets, but it's
> not something I've seen discussed in the MI literature.
>
> Best wishes,
> Paul von Hippel
> LBJ School of Public Affairs
> Sid Richardson Hall 3.251
> University of Texas, Austin
> 2315 Red River, Box Y
> Austin, TX  78712
> (512) 537-8112
>

Re: Run time

Reply via email to