A few days ago I stumbled upon a topic "Formula wanted to estimate 
'minimize' run times 
<https://groups.google.com/forum/#!topic/metamath/sODHzKvNsgs>" by Norman 
where he suggested an interesting problem of finding an approximate formula 
for minimization run time based on proof length and label number. It would 
allow (in theory) to schedule minimization jobs more evenly which must make 
a difference when performing full set.mm minimizations.

I decided to give this task a shot, but in the end results turned out to be 
more dull than I had hoped for. I think it's still worth reporting on these 
finds as maybe someone with better data analysis skills will get interested 
in the problem. So, my procedure was as follows.

First of all, I prepared the data as described in the original topic by 
Norman. Basically, I took run times from the minimization log by Dylan 
Houlihan and augmented them by label numbers and proof sizes, getting the 
total of 28438 items (a few had to be removed since, for instance, 
4exdistrOLD wasn't present in the set.mm version linked in the post). Label 
numbers in mathboxes were shifted up by 1070 (as suggested in the original 
post) to compensate for the fact that 'minimize' doesn't scan mathboxes by 
default.

Perhaps not surprisingly, the majority of theorems (~76%) take less than 15 
seconds to minimize. The mean minimization time is ~45s and standard 
deviation is ~228, so already the "formula" predicting 0s for each theorem 
gives us a "small" mean error :-) Intuitively it seems that the formula 
ought to look like c*label^a*size^b, where a, b and c are some constants. I 
have tried various curve-fitting techniques available in Wolfram 
Mathematica, for instance, it suggested the following: 
0.0000811892*label^0.405776*size^1.33858, which has mean error ~35 and 
standard deviation ~155. A simpler hand-found formula 
4.33e-9*label*size^1.7 produces comparable mean ~25 and standard deviation 
~180.

These formulae have a small "global" coefficient (~1e-9 in the latter 
example) which suggests that perhaps one should differentiate between the 
small run time and large run time cases.  I set the small class to be 
theorems taking less than 15s to minimize and trained a basic linear 
classifier to distinguish it from the "large time" class. I curve-fitted a 
formula of the form c*label^a*size^b separately in the small and large 
cases to obtain the following prediction formula:

If -0.0572153 + 2.68348e-6*label + 0.000206234*size < 0.5 then 
4.92e-6*label^0.68*size^1.03 else 0.000479*label^0.32*size^1.25,

which has mean error ~28 and standard deviation ~156. Clearly it's not a 
large improvement over simpler formulae listed above.

That is more or less what I have for approximating minimization run times. 
Now, the intended application of such an approximation is scheduling 
minimization jobs evenly, so lets switch our attention to that.

I assumed that we have 16 CPUs (processes, threads, etc) available, and we 
schedule our theorems for minimization by the Longest Processing Time 
algorithm (the task is assigned to a CPU with smallest total load). 
Obviously during the scheduling phase we calculate time by one of our 
formulae, but after theorems have been spread to CPUs, we can evaluate our 
procedure by calculating real run times. The relevant parameters are 
average CPU load which is always ~79495s, and latency which I define to be 
the difference between maximum CPU time and minimum CPU time (the smaller 
latency is, the closer CPU loads are). It is useful to compare our results 
with a straightforward scheduling approach where one randomly partitions 
theorems into 16 (almost) equal lists. In my experiments such an approach 
yields latency ~39397, so it can be taken as a base value.

So, the first formula (together with LPT scheduling) produces the following 
list of CPU run times:

[76126,83042,81573,90121,72944,79387,74600,84677,78886,69388,86309,83830,69727,74918,81779,84567],

has latency ~20733 and standard deviation ~6055. The second (hand-found) 
formula gives latency ~32342 and standard deviation ~8781, which is an 
arguably worse result (the latency is not much different from the random 
approach). Finally, the last (large) formula gives latency ~40756 and 
standard deviation ~9684, which is worse than a random approach.

What is, in my opinion, surprising is that the simpler formula 'size' (i.e. 
minimization time is proportional to the proof size; notice that we don't 
need a multiplicative constant since we are only summing and comparing 
values in our scheduling algorithm) produces latency ~19218 and standard 
deviation ~6388 which is on par with the first formula found by 
curve-fitting. I have tried other simple expressions without great success, 
so this seems like the best result I can get.

That is how far I could get with my basic knowledge of statistics. 
Apparently the end result is that one can simply estimate theorem 
minimization time by its proof size and then use well-known scheduling 
algorithms to assign jobs :-) I couldn't find any reports on the 
minimization run performed by Dylan Houlihan (for instance, I don't know 
the "latency" parameter for that run or even the number of threads used), 
so I don't know whether the approach I'm suggesting has any benefits over 
the method used by Norman and Dylan. Nevertheless, I hope this small report 
was interesting to read and maybe it will motivate other people to try to 
come up with a better formula and/or approach. Here's the dataset I used: 
https://gist.github.com/savask/900161e9d093a5b787b6473113dfde16. I'm also 
looking forward to any questions or criticism regarding approaches used 
here. I can't boast a good knowledge of statistics or data science and my 
"analysis" must be taken with a grain of salt, so I will be happy to hear 
opinions of other people.

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/e1867625-9c82-4e9d-9721-962590b6bd2b%40googlegroups.com.

Reply via email to