A problem with make clean and different configurations isn't too surprising. By the way, you can always 'make clobber' to do even more cleaning.
-michael On 5/14/15, 12:03 PM, "Brian Guarraci" <[email protected]> wrote: >OK, so started from a clean distribution and CHPL_TARGET_ARCH=native is >now running w/o the CHPL_MEM error. I think there must be a bug >somewhere in the make clean (if I were using a git repo, i would > have used git clean -fdx and probably avoided this problem). I'm glad >it's a build issue. > >Now looking into perf issues w/ binary compiled with --fast. > > >On Tue, May 12, 2015 at 8:12 PM, Brian Guarraci ><[email protected]> wrote: > >Yesterday, I did multiple full clean builds (for various combos) >suspecting what you suggest. I think there are some bugs and I need to >dig in to provide better clues for this list. I'm using 1.11.0 official >build. > >One strange symptom was that using -nl 1 showed this error but -nl 16 >just hung indefinitely, with no cpu or network activity. Tried this with >no optimizations as well, same result. Could also be partly related to >the weird gasnet issue I mentioned. > >> On May 12, 2015, at 7:40 PM, Michael Ferguson <[email protected]> >>wrote: >> >> Hi Brian - >> >> I've seen errors like that if I don't run 'make' again in the >> compiler directory after setting the CHPL_TARGET_ARCH environment >> variable. You need to run 'make' again since the runtime builds >> with the target architecture setting. >> >> Of course this and the CHPL_MEM errors could be bugs... >> >> Hope that helps, >> >> -michael >> >>> On 5/12/15, 10:03 AM, "Brian Guarraci" <[email protected]> wrote: >>> >>> quick, partial, follow up: >>> >>> recompiled chapel with the CHPL_TARGET_ARCH=native and hit the >>>following >>> error: >>> >>> /home/ubuntu/src/chapel-1.11.0/runtime/etc/Makefile.comm-gasnet:19: >>> >>>/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-na >>>ti >>> ve/seg-everything/nodbg/include/udp-conduit/udp-par.mak: No such file >>>or >>> directory >>> make: *** No rule to make target >>> >>>`/home/ubuntu/src/chapel-1.11.0/third-party/gasnet/install/linux32-gnu-n >>>at >>> ive/seg-everything/nodbg/include/udp-conduit/udp-par.mak'. Stop. >>> error: compiling generated source >>> >>> I tweaked the gastnet folder names, but never got it to work yet. >>> Subsequent attempts to run the compiled program resulted in "set >>>CHPL_MEM >>> to a more appropriate mem type". I tried setting the mem type to a few >>> different choices and didn't help. Needs more >>> investigation. >>> >>> >>> I did managed to run some local --fast tests though and the --local >>> version ran in 5s while the --no-local version ran in about 12s. >>> >>> Additionally, I played around with the local keyword and that also >>>seems >>> to make some difference, at least locally. I need to try this on the >>> distributed version when I stand it back up. >>> >>>> On Mon, May 11, 2015 at 1:05 PM, Brad Chamberlain <[email protected]> >>>>wrote: >>>> >>>> >>>> For well-written programs, most of the --local vs. --no-local >>>> differences show up as CPU overhead rather than network overhead. >>>>I.e., >>>> we tend not to do unnecessary communications, we simply execute extra >>>> scalar code to determine that communication is unnecessary, >>> and the presence of this code hinders the back-end compiler's ability >>>to >>> optimize the per-node computation. >>>> >>>> Here's an example: A given array access like A[i] may not know >>>>whether >>>> the access is local or remote, so will introduce communication-related >>>> code to disambiguate. Even if that code doesn't generate >>>>communication, >>>> it can be ugly enough to throw the back-end >>> C compiler off. >>>> >>>> Some workarounds to deal with this: >>>> >>>> * the 'local' block (documented in doc/release/technotes/README.local >>>> -- this is a big hammer and likely to be replaced with more data- >>>> centric capabilities going forward, but can be helpful in the >>>> meantime if you can get it working in a chunk of code. >>>> >>>> * I don't know how fully-fleshed out these features are, but there >>>> are at least draft capabilities for .localAccess() and .localSlice() >>>> methods on some array types to reduce overheads like the ones in >>>> my simple example above. I.e., if I know that A[i] is local for a >>>> given distributed array, A.localAccess(i) is likely to give better >>>> performance. >>>> >>>> >>>> But maybe I should start with a higher-level question: What kinds of >>>> data structures does your code use, and what types of idioms do you >>>>use >>>> to get multi-locale executions going? (e.g., distributed arrays + >>>> foralls? Or more manually-distributed data structures >>> + task parallelism + on-clauses?) >>>> >>>> Thanks, >>>> >>>> -Brad >>>> >>>> >>>>> On Mon, 11 May 2015, Brian Guarraci wrote: >>>>> >>>>> I was aware of the on-going progress in optimizing the comm, but I'll >>>>> take >>>>> a look at your docs. I'll also give the -local vs --no-local >>>>> experiment a >>>>> try. >>>>> >>>>> I tested the network layer and saw my nodes were operating near peak >>>>> network capacity so it wasn't a transport issue. Yes, the CPUs (of >>>>> which >>>>> there are 4 per node) were nearly fully pegged. Considering the >>>>>level >>>>> of >>>>> complexity of the code, I suspect it was mostly overhead. I even was >>>>> looking for a way to pin the execution to a single proc as I wonder >>>>>if >>>>> there was some kind of thrashing going on between procs. The funny >>>>> thing >>>>> was the more I tried to optimize the program to do less network >>>>> traffic, >>>>> the slower it got. >>>>> >>>>> On Mon, May 11, 2015 at 10:21 AM, Brad Chamberlain <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> Hi Brian -- >>>>>> >>>>>> I believe that setting CHPL_TARGET_ARCH to 'native' should get you >>>>>> better >>>>>> results as long as you're not cross-compiling. Alternatively, you >>>>>> can set >>>>>> it to 'none' which will squash the warning you're getting. In any >>>>>> case, I >>>>>> wouldn't expect the lack of --specialize optimizations to be the >>>>>> problem >>>>>> here (but if you're throwing components of --fast manually, you'd >>>>>> want to >>>>>> be sure to add -O in addition to --no-checks). >>>>>> >>>>>> Generally speaking, Chapel programs compiled for --no-local >>>>>> (multi-locale >>>>>> execution) tend to generate much worse per-node code than those >>>>>> compiled >>>>>> for --local (single-locale execution), and this is an area of active >>>>>> optimization effort. See the "Performance Optimizations and >>>>>> Generated Code >>>>>> Improvements" release note slides at: >>>>>> >>>>>> >http://chapel.cray.com/download.html#releaseNotes ><http://chapel.cray.com/download.html#releaseNotes> >>> >http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf ><http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf> >>> <http://chapel.cray.com/releaseNotes/1.11/06-PerfGenCode.pdf> >>>>>> >>>>>> and particularly, the section entitled "the 'local field' pragma" >>>>>>for >>>>>> more >>>>>> details on this effort (starts at slide 34). >>>>>> >>>>>> In a nutshell, the Chapel compiler conservatively assumes that >>>>>>things >>>>>> are >>>>>> remote rather than local when in doubt (to emphasize correctness >>>>>>over >>>>>> fast >>>>>> but incorrect programs), and then gets into doubt far more often >>>>>>than >>>>>> it >>>>>> should. We're currently working on tightening up this gap. >>>>>> >>>>>> This could explain the full difference in performance that you're >>>>>> seeing, >>>>>> or something else may be happening. One way to check into this >>>>>>might >>>>>> be to >>>>>> run a --local vs. --no-local execution with CHPL_COMM=none to see >>>>>>how >>>>>> much >>>>>> overhead is added. The fact that all CPUs are pegged is a good >>>>>> indication >>>>>> that you don't have a problem with load balance or distributing >>>>>> data/computation across nodes, I'd guess? >>>>>> >>>>>> -Brad >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, 11 May 2015, Brian Guarraci wrote: >>>>>> >>>>>> I should add that I did supply --no-checks and that helped about >>>>>>10%. >>>>>>> >>>>>>> >>>>>>> On Mon, May 11, 2015 at 10:04 AM, Brian Guarraci <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> It says: >>>>>>>> >>>>>>>> >>>>>>>> warning: --specialize was set, but CHPL_TARGET_ARCH is 'unknown'. >>>>>>>> If you >>>>>>>> want any specialization to occur please set CHPL_TARGET_ARCH to a >>>>>>>> proper >>>>>>>> value. >>>>>>>> It's unclear which target arch is appropriate. >>>>>>>> >>>>>>>> On Mon, May 11, 2015 at 9:55 AM, Brad Chamberlain <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Hi Brian -- >>>>>>>>> >>>>>>>>> Getting --fast working should definitely be the first priority. >>>>>>>>> What >>>>>>>>> about it fails to work? >>>>>>>>> >>>>>>>>> -Brad >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, 10 May 2015, Brian Guarraci wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I've been testing my search index on my 16 node ARM system and >>>>>>>>>> have >>>>>>>>>> been >>>>>>>>>> running into some strange behavior. The cool part is that the >>>>>>>>>> locale >>>>>>>>>> partitioning concept seems to work well, the downside is that >>>>>>>>>>the >>>>>>>>>> system >>>>>>>>>> is >>>>>>>>>> very slow. I've rewritten the approach a few different ways and >>>>>>>>>> haven't >>>>>>>>>> made a dent, so wanted to ask a few questions. >>>>>>>>>> >>>>>>>>>> On the ARM processors, I can only use FIFO and can't optimize >>>>>>>>>> (--fast >>>>>>>>>> doesn't work). Is this going to significantly affect >>>>>>>>>>cross-locale >>>>>>>>>> performance? >>>>>>>>>> >>>>>>>>>> I've looked at the generated C code and tried to minimize the >>>>>>>>>> _comm_ >>>>>>>>>> operations in core methods, but doesn't seem to help. Network >>>>>>>>>> usage is >>>>>>>>>> still quite low (100K/s) while CPUs are pegged. Are there any >>>>>>>>>> profiling >>>>>>>>>> tools I can use to understand what might be going on here? >>>>>>>>>> >>>>>>>>>> Generally, on my laptop or single node, I can index about 1.1MM >>>>>>>>>> records >>>>>>>>>> in >>>>>>>>>> under 10s. With 16 nodes, it takes 10min to do 100k records. >>>>>>>>>> >>>>>>>>>> Wondering if there's some systemic issue at play here and how >>>>>>>>>>can >>>>>>>>>> further >>>>>>>>>> investigate. >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> Brian >> > > > > > > > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
