Hi Brian -- Note that there's a 'make clobber' in addition to 'make clean'. In principle, 'make clean' is meant to remove unnecessary intermediate files but leave things in a working condition while 'make clobber' is meant to return things close(r) to what's in the repo. I've been stuck in meetings the past few days so haven't even had a chance to read this thread yet, though it does sound, on the surface, like there may be a bug. Maybe in 'make clean' but perhaps more likely in calculating dependences.
-Brad On Thu, 14 May 2015, Brian Guarraci wrote: > OK, so started from a clean distribution and CHPL_TARGET_ARCH=native is now > running w/o the CHPL_MEM error. I think there must be a bug somewhere in > the make clean (if I were using a git repo, i would have used git clean > -fdx and probably avoided this problem). I'm glad it's a build issue. > > Now looking into perf issues w/ binary compiled with --fast. > > On Tue, May 12, 2015 at 8:12 PM, Brian Guarraci <[email protected]> wrote: > >> Yesterday, I did multiple full clean builds (for various combos) >> suspecting what you suggest. I think there are some bugs and I need to dig >> in to provide better clues for this list. I'm using 1.11.0 official build. >> >> One strange symptom was that using -nl 1 showed this error but -nl 16 just >> hung indefinitely, with no cpu or network activity. Tried this with no >> optimizations as well, same result. Could also be partly related to the >> weird gasnet issue I mentioned. >> >>> On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]> >> wrote: >>> >>> Hi Brian - >>> >>> I've seen errors like that if I don't run 'make' again in the >>> compiler directory after setting the CHPL_TARGET_ARCH environment >>> variable. You need to run 'make' again since the runtime builds >>> with the target architecture setting. >>> >>> Of course this and the CHPL_MEM errors could be bugs... >>> >>> Hope that helps, >>> >>> -michael >>> >>>> On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote: >>>> >>>> quick, partial, follow up: >>>> >>>> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the following >>>> error: >>>> >>>> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19: >>>> >> /home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nati >>>> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file or >>>> directory >>>> make: *** No rule to make target >>>> >> `/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-nat >>>> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop. >>>> error: compiling generated source >>>> >>>> I tweaked the gastnet folder names, but never got it to work yet. >>>> Subsequent attempts to run the compiled program resulted in "set >> CHPL_MEM >>>> to a more appropriate mem type". I tried setting the mem type to a few >>>> different choices and didn't help. Needs more >>>> investigation. >>>> >>>> >>>> I did managed to run some local --fast tests though and the --local >>>> version ran in 5s while the --no-local version ran in about 12s. >>>> >>>> Additionally, I played around with the local keyword and that also seems >>>> to make some difference, at least locally. I need to try this on the >>>> distributed version when I stand it back up. >>>> >>>>> On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]> >> wrote: >>>>> >>>>> >>>>> For well-written programs, most of the --local vs. --no-local >>>>> differences show up as CPU overhead rather than network overhead. >> I.e., >>>>> we tend not to do unnecessary communications, we simply execute extra >>>>> scalar code to determine that communication is unnecessary, >>>> and the presence of this code hinders the back-end compiler's ability to >>>> optimize the per-node computation. >>>>> >>>>> Here's an example: A given array access like A[i] may not know whether >>>>> the access is local or remote, so will introduce communication-related >>>>> code to disambiguate. Even if that code doesn't generate >> communication, >>>>> it can be ugly enough to throw the back-end >>>> C compiler off. >>>>> >>>>> Some workarounds to deal with this: >>>>> >>>>> * the 'local' block (documented in doc/release/technotes/README.local >>>>> -- this is a big hammer and likely to be replaced with more data- >>>>> centric capabilities going forward, but can be helpful in the >>>>> meantime if you can get it working in a chunk of code. >>>>> >>>>> * I don't know how fully-fleshed out these features are, but there >>>>> are at least draft capabilities for .localAccess() and .localSlice() >>>>> methods on some array types to reduce overheads like the ones in >>>>> my simple example above. I.e., if I know that A[i] is local for a >>>>> given distributed array, A.localAccess(i) is likely to give better >>>>> performance. >>>>> >>>>> >>>>> But maybe I should start with a higher-level question: What kinds of >>>>> data structures does your code use, and what types of idioms do you use >>>>> to get multi-locale executions going? (e.g., distributed arrays + >>>>> foralls? Or more manually-distributed data structures >>>> + task parallelism + on-clauses?) >>>>> >>>>> Thanks, >>>>> >>>>> -Brad >>>>> >>>>> >>>>>> On Mon, 11 May 2015, Brian Guarraci wrote: >>>>>> >>>>>> I was aware of the on-going progress in optimizing the comm, but I'll >>>>>> take >>>>>> a look at your docs. I'll also give the -local vs --no-local >>>>>> experiment a >>>>>> try. >>>>>> >>>>>> I tested the network layer and saw my nodes were operating near peak >>>>>> network capacity so it wasn't a transport issue. Yes, the CPUs (of >>>>>> which >>>>>> there are 4 per node) were nearly fully pegged. Considering the level >>>>>> of >>>>>> complexity of the code, I suspect it was mostly overhead. I even was >>>>>> looking for a way to pin the execution to a single proc as I wonder if >>>>>> there was some kind of thrashing going on between procs. The funny >>>>>> thing >>>>>> was the more I tried to optimize the program to do less network >>>>>> traffic, >>>>>> the slower it got. >>>>>> >>>>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> Hi Brian -- >>>>>>> >>>>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you >>>>>>> better >>>>>>> results as long as you're not cross-compiling. Alternatively, you >>>>>>> can set >>>>>>> it to 'none' which will squash the warning you're getting. In any >>>>>>> case, I >>>>>>> wouldn't expect the lack of --specialize optimizations to be the >>>>>>> problem >>>>>>> here (but if you're throwing components of --fast manually, you'd >>>>>>> want to >>>>>>> be sure to add -O in addition to --no-checks). >>>>>>> >>>>>>> Generally speaking, Chapel programs compiled for --no-local >>>>>>> (multi-locale >>>>>>> execution) tend to generate much worse per-node code than those >>>>>>> compiled >>>>>>> for --local (single-locale execution), and this is an area of active >>>>>>> optimization effort. See the "Performance Optimizations and >>>>>>> Generated Code >>>>>>> Improvements" release note slides at: >>>>>>> >>>>>>> http://chapel.cray.com/download.html#releaseNotes >>>> http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf >>>> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf> >>>>>>> >>>>>>> and particularly, the section entitled "the 'local field' pragma" for >>>>>>> more >>>>>>> details on this effort (starts at slide 34). >>>>>>> >>>>>>> In a nutshell, the Chapel compiler conservatively assumes that things >>>>>>> are >>>>>>> remote rather than local when in doubt (to emphasize correctness over >>>>>>> fast >>>>>>> but incorrect programs), and then gets into doubt far more often than >>>>>>> it >>>>>>> should. We're currently working on tightening up this gap. >>>>>>> >>>>>>> This could explain the full difference in performance that you're >>>>>>> seeing, >>>>>>> or something else may be happening. One way to check into this might >>>>>>> be to >>>>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see how >>>>>>> much >>>>>>> overhead is added. The fact that all CPUs are pegged is a good >>>>>>> indication >>>>>>> that you don't have a problem with load balance or distributing >>>>>>> data/computation across nodes, I'd guess? >>>>>>> >>>>>>> -Brad >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, 11 May 2015, Brian Guarraci wrote: >>>>>>> >>>>>>> I should add that I did supply --no-checks and that helped about 10%. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> It says: >>>>>>>>> >>>>>>>>> >>>>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'. >>>>>>>>> If you >>>>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a >>>>>>>>> proper >>>>>>>>> value. >>>>>>>>> It's unclear which target arch is appropriate. >>>>>>>>> >>>>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hi Brian -- >>>>>>>>>> >>>>>>>>>> Getting --fast working should definitely be the first priority. >>>>>>>>>> What >>>>>>>>>> about it fails to work? >>>>>>>>>> >>>>>>>>>> -Brad >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I've been testing my search index on my 16 node ARM system and >>>>>>>>>>> have >>>>>>>>>>> been >>>>>>>>>>> running into some strange behavior. The cool part is that the >>>>>>>>>>> locale >>>>>>>>>>> partitioning concept seems to work well, the downside is that the >>>>>>>>>>> system >>>>>>>>>>> is >>>>>>>>>>> very slow. I've rewritten the approach a few different ways and >>>>>>>>>>> haven't >>>>>>>>>>> made a dent, so wanted to ask a few questions. >>>>>>>>>>> >>>>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize >>>>>>>>>>> (--fast >>>>>>>>>>> doesn't work). Is this going to significantly affect >> cross-locale >>>>>>>>>>> performance? >>>>>>>>>>> >>>>>>>>>>> I've looked at the generated C code and tried to minimize the >>>>>>>>>>> _comm_ >>>>>>>>>>> operations in core methods, but doesn't seem to help. Network >>>>>>>>>>> usage is >>>>>>>>>>> still quite low (100K/s) while CPUs are pegged. Are there any >>>>>>>>>>> profiling >>>>>>>>>>> tools I can use to understand what might be going on here? >>>>>>>>>>> >>>>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM >>>>>>>>>>> records >>>>>>>>>>> in >>>>>>>>>>> under 10s. With 16 nodes, it takes 10min to do 100k records. >>>>>>>>>>> >>>>>>>>>>> Wondering if there's some systemic issue at play here and how can >>>>>>>>>>> further >>>>>>>>>>> investigate. >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> Brian >>> >> > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
