Second installment, again looking for suggestions and additions. The whole document is expected to be cumulated and uploaded to some wiki.
Part III will be about analysis of the data. Note that for lifetimes, the example we are addressing here, we have a combination of known data (or narrow interval censored data) and pure right censored data, and that recent hosts contribute to the right censored part in greater proportion than old hosts. Note also that Weibull and gamma distributions are the most appropriate for this kind of data, from reliability theory. ================================================================================= BOINC STATISTICAL RESEARCH HOW-TO, part II ================================================================================= II PREVIEWING/PLOTTING THE DATA II.0 -- Preconditioning the data In the typical session, the example datafile is all, or a random selection, of s...@home public data, removing all the rows with rpc_time or create_time equal to 0. In bash, ie both Linux and Mac, you can rely on awk or sed, as well as cat, nl, cut, paste, grep -o, grep -v, etc... In Windows ? No idea :-( II.1 --- Gnuplot Recipes II.1.a -- A typical session plot 'createrpcsin0.dat' using (int(($2-$1)/3600/24)):(1) smooth frequency plot 'createrpcsin0.dat' using ($2-$1):(1):(3600*24) smooth kdensity #gnuplot "histograms" do not support logscale. You need some external adhoc utility :-( #Lets get data straight from mysql... plot '<mysql calculos -e "select truncate((rpc_time-create_time)/3600/24,0) as lf, count(*) as c from intervalos where rpc_time>create_time and create_time>0 group by lf" ' using 1:2 set logscale y replot wei(x) = N*(x**(alpha-1))*exp(-(x/delta)**alpha) lwei(x)= logN + (alpha-1)* log(x) - (x/delta)**alpha fit [1:1000] lwei(x) '<mysql mydatabase --skip-column-names -e "select truncate((rpc_time-create_time)/3600/24,0) as lf, count(*) as c from hosttable where rpc_time>create_time and create_time>0 group by lf" ' using 1:(log($2+0.01)) via logN,alpha,delta replot exp(lwei(x)) plot 'createrpcsin0.dat' using 1:2 with dots #question: is it possible to do, in gnuplot, a density map from 2D scatter data? II.1.b -- Generic tricks New versions of gnuplot have the terminal type "canvas", for HTML5 compatible output. Here http://pelican.rsvs.ulaval.ca/mediawiki/index.php/Making_density_maps_using_Gnuplot you can copypaste a python script to do density maps. II.1.c -- Tricks when working with BOINC data II.2 --- R Recipes II.2.a -- A typical session datos=read.table("createrpcsin0.dat", col.names=c("createtime", "rpctime"), strip white=TRUE) datos$lifetimes <- with(datos, rpctime-createtime) summary(datos) summary(datos$lifetimes/3600/24) hist(datos$lifetimes) plot(hist(datos$lifetimes)$counts, log="y", type="h") plot(hist(datos$lifetimes,breaks=c(seq(0,max(datos$lifetimes)+3600*24*7,3600*24*7)))$counts,log="y", type="h") plot(density(datos$lifetimes)) plot(density(datos$lifetimes),log="y") plot(density(datos$lifetimes/3600/24/7),log="y") ds <-density(datos$lifetimes) plot(ds$x, ds$y/(max(ds$x)-ds$x), log="y", type="h") truncated <- datos$lifetimes[ (datos$rpctime<max(datos$rpctime)-3600*24*30 ) ] ds2 <- density(truncated) plot(ds2$x, ds2$y/(max(ds$x)-ds2$x), log="y", type="h") plot(hist(datos$createtime,breaks=c(seq(min(datos$createtime),max(datos$createtime)+3600*24*7,3600*24*7)))$counts, log="y", type="h") plot(hist(datos$rpctime,breaks=c(seq(min(datos$rpctime),max(datos$rpctime)+3600*24*7,3600*24*7)))$counts, log="y", type="h") # # It is possible to smooth 3d plots by using two different techniques, sm.density from # the library SM or kde2d from the library MASS. The former is not available in all the distributions, # the later needs a lot of memory # library(MASS) %alternative: sm.density, from library(SM) crelifes.density <- kde2d(datos$createtime,datos$lifetimes) #### Error: cannot allocate vector of size 521.0 Mb crelifes.density <- kde2d(datos$createtime,datos$lifetimes,n=15) contour(crelifes.density) filled.contour(crelifes.density) with(crelifes.density, contour(x,y,log(z))) with(crelifes.density, contour(x,y,log(z),nlevels=150)) filled.contour(crelifes.density,nlevels=250) II.2.b -- Generic tricks Note that in a lot of results in R, the points to be plotted are contained in two vectors x and y inside the structure, and you can access them. So you have density(..)$x and density(...)$y, for instance. You can use summary(...) to get a view of such content. II.2.c -- Tricks when working with BOINC data II.3 -- Octave recipes ? _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
