quick update: here are some stats i scraped over the past week of ALL pull request builder projects and timeout failures. due to the large number of spark ghprb jobs, i don't have great records earlier than oct 7th. the data is current up until ~230pm today:
spark and new spark ghprb total builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" | bc | sed "s/^\.//g"); line="$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%"; echo -e $line; done 10-09 -- total builds: 140 p/f: 92/48 fail%: 34% 10-10 -- total builds: 65 p/f: 59/6 fail%: 09% 10-11 -- total builds: 29 p/f: 29/0 fail%: 0% 10-12 -- total builds: 24 p/f: 21/3 fail%: 12% 10-13 -- total builds: 39 p/f: 35/4 fail%: 10% 10-14 -- total builds: 7 p/f: 5/2 fail%: 28% 10-15 -- total builds: 37 p/f: 34/3 fail%: 08% 10-16 -- total builds: 71 p/f: 59/12 fail%: 16% 10-17 -- total builds: 26 p/f: 20/6 fail%: 23% all other ghprb builds vs git fetch timeouts: $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" | bc | sed "s/^\.//g"); line="$x -- total builds: $total\tp/f: $passed/$failed\tfail%: $fail_percent%"; echo -e $line; done 10-09 -- total builds: 16 p/f: 16/0 fail%: 0% 10-10 -- total builds: 46 p/f: 40/6 fail%: 13% 10-11 -- total builds: 4 p/f: 4/0 fail%: 0% 10-12 -- total builds: 2 p/f: 2/0 fail%: 0% 10-13 -- total builds: 2 p/f: 2/0 fail%: 0% 10-14 -- total builds: 10 p/f: 10/0 fail%: 0% 10-15 -- total builds: 5 p/f: 5/0 fail%: 0% 10-16 -- total builds: 5 p/f: 5/0 fail%: 0% 10-17 -- total builds: 0 p/f: 0/0 fail%: 0% note: the 15th was the day i rolled back to the earlier version of the git plugin. it doesn't seem to have helped much, so i'll probably bring us back up to the latest version soon. also note: rocking some floating point math on the CLI! ;) i also compared the distribution of git timeout failures vs time of day, and there appears to be no correlation. the failures are pretty evenly distributed over each hour of the day. we could be hitting the rate limit due to the ghprb hitting github a couple of times for each build, but we're averaging ~10-20 builds per hour (a build hits github 2-4 times, from what i can tell). i'll have to look more in to this on monday, but suffice to say we may need to move from unauthorized https fetches to authorized requests. this means retrofitting all of our jobs. yay! fun! :) another option is to have local mirrors of all of the repos. the problem w/this is that there might be a window where changes haven't made it to the local mirror and tests run against it. more fun stuff to think about... now that i have some stats, and a list of all of the times/dates of the failures, i will be drafting my email to github and firing that off later today or first thing monday. have a great weekend everyone! shane, who spent way too much time on the CLI and is ready for some beer. On Thu, Oct 16, 2014 at 1:04 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > On Thu, Oct 16, 2014 at 3:55 PM, shane knapp <skn...@berkeley.edu> wrote: > >> i really, truly hate non-deterministic failures. > > > Amen bruddah. >