> It begs the question why the bounds check is inside the loop. > Can't you check against shape?
There is no alternative. In g{x where g is a vector and ** it doesn't know that g is the grade of x **, it has to check every element of g for index error. > I'm aware of branch mis predict costs. Then you should be able to solve the minimum of a vector puzzle. :-) On Wed, Mar 5, 2014 at 3:04 PM, Yike Lu <yikelu.h...@gmail.com> wrote: > I'm aware of branch mis predict costs. > > It begs the question why the bounds check is inside the loop. Can't you > check against shape? > On Mar 5, 2014 4:57 PM, "Roger Hui" <rogerhui.can...@gmail.com> wrote: > > > Compare the two: > > > > Sorting (fixed typos!): > > > > I4 count[M]; > > memset(count,0x00,sizeof(count)); > > for(i=0;i<n;++i)count[*x++]++; > > for(i=0;i<M;++i){m=count[i]; for(j=0;j<m;++j)*z++=i;} > > > > > > Indexing: > > > > #define ASSERT(p,stmt) {if(!(p)){stmt;} > > for(i=0;i<n;++i){j=*g++; ASSERT(0<=j&&j<n,INDEX_ERROR); *z++=x[j];} > > > > > > Branching in modern CPUs is relatively > > expensive<https://en.wikipedia.org/wiki/Branch_predictor>. > > Apparently the rest of the world is catching up to APL/J. See also > > Knuth's Two Notes on > > Notation<http://arxiv.org/PS_cache/math/pdf/9205/9205211v1.pdf> (he > > calls it Iverson's convention). > > > > In the sort program there isn't a branch in sight, other than possibly in > > the looping, and the C compiler is pretty good at that. In the indexing, > > you have an if statement, in which (in this case) 100% of the time the > > branch is not taken. A branch which is 100% (or 0%? I forget) not taken > is > > particularly slow. > > > > > > > > > > > > On Wed, Mar 5, 2014 at 2:41 PM, Yike Lu <yikelu.h...@gmail.com> wrote: > > > > > Roger, why is index lookup so slow in the g2 case (see my message > > earlier)? > > > Are you really saying that bounds checking makes it so that the lookup > is > > > only as fast as the sort? > > > > > > > > > On Wed, Mar 5, 2014 at 4:35 PM, Roger Hui <rogerhui.can...@gmail.com> > > > wrote: > > > > > > > > I4 count[M]; > > > > > memset(count,0x00,sizeof(count)); > > > > > for(i=0;i<n;++i)count[*x++]=1; > > > > > > > > More typos: > > > > > > > > for(i=0;i<n;++i)count[*x++]++; > > > > > > > > > > > > > > > > > > > > On Wed, Mar 5, 2014 at 2:19 PM, Roger Hui <rogerhui.can...@gmail.com > > > > > > wrote: > > > > > > > > > Good answers. For /:~x vs. g{x, the explanations are: > > > > > > > > > > - Indexing must check for index error. Sorting does not. > > > > > - Indexing uses random read access over a large chunk of memory > > > (i.e. > > > > > x). Sort does not. > > > > > > > > > > A more detailed explanation: To sort over a small known universe > > (and > > > > > characters definitely qualify), you basically compute #/.~x (the > > > ordering > > > > > is wrong, but you get the idea). In C: > > > > > > > > > > I4 count[M]; > > > > > memset(count,0x00,sizeof(count)); > > > > > for(i=0;i<n;++i)count[*x++]=1; > > > > > > > > > > > > > > > This is lightning fast on modern CPUs: sequential read access and > no > > > > > branch prediction fails. (And the ordering is correct.) Once > having > > > the > > > > > counts, as Henry said, you can do count#a. or in C: > > > > > > > > > > for(i=0;i<M;++i){m=count[j]; for(j=0;j<m;++j)*z++=i;} > > > > > > > > > > > > > > > Also lightning fast with very localized reads. > > > > > > > > > > It's ironic that in school sorting is an application with heavy > > > emphasis > > > > > on comparisons, counting # of comparisons, etc. In the method > above, > > > > there > > > > > is not a single comparison involving x. I once told someone that I > > can > > > > > sort 4-byte integers and 8-byte IEEE floats in linear time. He > > looked > > > at > > > > > me like I was crazy, presumably remembering from school that > sorting > > > was > > > > > PROVEN to take n log n comparisons. > > > > > > > > > > As for why sorting is faster than grading, see > > > > > http://www.jsoftware.com/jwiki/Essays/Sorting_versus_Grading > > > > > > > > > > Now, for those of you who know C (or other scalar language), is > > there a > > > > > faster way to find the minimum of a vector of small integers x > > (2-byte > > > > > integers, say) than the following: > > > > > > > > > > min=-32768; > > > > > for(i=0;i<n;++i){if(min>*x)min=*x; ++x;} > > > > > > > > > > > > > > > I know an alternative which is 70% faster. No fancy SSE > > instructions. > > > > No > > > > > multicore. No loop unrolling. > > > > > > > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > > > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm