On Apr 25, 2012, at 13:59 , Alex Margolin wrote: > I guess you are right. > > I started looking into the communication passing between processes and I may > have found a problem with the way I handle "reserved" data requested at > prepare_src()... I've tried to write pretty much the same as TCP (the > relevant code is around "if(opal_convertor_need_buffers(convertor))") and > when I copy the buffered data to (frag+1) the program works. When I try to > optimize the code by allowing the segment to point to the original location, > I get MPI_ERR_TRUNCATE. I've printed out the data sent and recieved, and what > I got ("[]" for sent, "<>" for received, running osu_latency) is appended > below. > > Question is: Where is the code which is responsible for writing the reserved > data?
It is the PML headers. Based on the error you reported OMPI is complaining about truncated data on an MPI_Barrier … that's quite bad as the barrier is one of the few operations that do not manipulate any data. I guess the PML headers are not located at the expected displacement in the fragment, so the PML is using wrong values. george. > > Thanks, > Alex > > > Always assume opal_convertor_need_buffers - works (97 is the application > data, preceded by 14 reserved bytes): > > ... > [65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,] > <65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,> > [65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,] > <65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,> > [65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,] > <65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,> > [65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,] > <65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,> > [65,0,0,0,1,0,0,0,1,0,0,0,91,-112,97,97,97,97,] > ... > > Detect when not opal_convertor_need_buffers - fails: > > ... > [65,0,0,0,0,0,0,0,1,0,0,0,-15,85,] > <65,0,0,0,0,0,0,0,1,0,0,0,-15,85,97,> > [65,0,0,0,1,0,0,0,1,0,0,0,-15,85,] > <65,0,0,0,1,0,0,0,1,0,0,0,-15,85,97,> > [65,0,0,0,0,0,0,0,1,0,0,0,-14,85,] > <65,0,0,0,0,0,0,0,1,0,0,0,-14,85,97,> > [65,0,0,0,1,0,0,0,1,0,0,0,-14,85,] > <65,0,0,0,1,0,0,0,1,0,0,0,-14,85,97,> > [65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,] > 1 453.26 > [65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,] > <65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,97,> > <65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,97,> > [singularity:13509] *** An error occurred in MPI_Barrier > [singularity:13509] *** reported by process [2239889409,140733193388033] > [singularity:13509] *** on communicator MPI_COMM_WORLD > [singularity:13509] *** MPI_ERR_TRUNCATE: message truncated > [singularity:13509] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [singularity:13509] *** and potentially your MPI job) > [singularity:13507] 1 more process has sent help message help-mpi-errors.txt > / mpi_errors_are_fatal > [singularity:13507] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > alex@singularity:~/huji/benchmarks/mpi/osu-micro-benchmarks-3.5.2$ > > On 04/25/2012 04:35 PM, George Bosilca wrote: >> Alex, >> >> You got the banner of the FT benchmark, so I guess at least the rank 0 >> successfully completed the MPI_Init call. This is a hint that you should >> investigate more into the point-to-point logic of your mosix BTL. >> >> george. >> >> On Apr 25, 2012, at 09:30 , Alex Margolin wrote: >> >>> NAS Parallel Benchmarks 3.3 -- FT Benchmark >>> >>> No input file inputft.data. Using compiled defaults >>> Size : 64x 64x 64 >>> Iterations : 6 >>> Number of processes : 4 >>> Processor array : 1x 4 >>> Layout type : 1D >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel