cchen added a comment. In D79972#2069435 <https://reviews.llvm.org/D79972#2069435>, @ABataev wrote:
> In D79972#2069366 <https://reviews.llvm.org/D79972#2069366>, @cchen wrote: > > > In D79972#2069358 <https://reviews.llvm.org/D79972#2069358>, @ABataev wrote: > > > > > In D79972#2069322 <https://reviews.llvm.org/D79972#2069322>, @cchen wrote: > > > > > > > In D79972#2068976 <https://reviews.llvm.org/D79972#2068976>, @ABataev > > > > wrote: > > > > > > > > > Still: Did you think about implementing it in the compiler instead of > > > > > the runtime? > > > > > > > > > > > > I'm not sure I understand your question, which part of code are you > > > > asking? > > > > The main work compiler needs to do is to send the {offset, count, > > > > stride} struct to runtime. > > > > > > > > > I mean did you think about calling `__tgt_target_data_update` function in > > > a loop in the compiler-generated code instead of putting it into the > > > runtime? > > > > > > Oh, I would prefer to call `tgt_target_data_update` once in the compiler > > and I'm also doing it now. > > > I was not quite correct. What I mean, is to generate the array with the array > section as VLA in the compiler, and fill it in the loop generated by the > compiler for non-contiguous sections but not in the runtime? > Say, we have the code: > > int arr[3][3] > ... > #pragma omp update to(arr[1:2][1:2] > > > > In this case, we're going to transfer the next elements: > > 000 > 0xx > 0xx > > > In the compiler-generated code we emit something like this: > > void *bptr[<n>]; > void *ptr[<n>]; > int64 sizes[<n>]; > int64 maptypes[<n>]; > for (int i = 0; i < <n>; ++i) { > bptr[i] = &arr[1+i][1]; > ptr[i] = &arr[1+i][1]; > sizes[i] = ...;' > maptypes[i] = ...; > } > call void @__tgt_target_data_update(i64 -1, i32 <n>, bptr, ptr, sizes, > maptypes); > > > With this solution, you won't need to modify the runtime and add a new > mapping flag. For my current implementation, we have discussed in the bi-weekly meeting several weeks back, and there was a general consensus that it was an acceptable approach. The major advantage of sending a descriptor to runtime can be elaborated in the following example: #define N 10000 int a[N][2]; … #pragma amp target update to (a[0:N][0:1]) This would require passing through O(N) entries in the tgt_target_data_update call, or 10000 entries. The current implementation only require a descriptor with 2 entries. I think this could be a real concern - splitting out the transfers in compiler-generated code results in a list containing one entry per non-contiguous chunk (easily hitting scaling issues), while the descriptor approach is bounded by the number of dimensions. That seems like a pretty compelling reason to use the descriptor - it’s much more space efficient. Also, the descriptor idea is very similar to how Cray supported Fortran dope vectors for years (we send in a pointer to a dope vector rather than a pointer to the data, and a flag to indicate it’s a dope vector, and the runtime library handles it as a dope vector). I think the runtime library changes will not be very extensive or difficult at all and we’re very willing to implement the runtime for non-contiguous. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D79972/new/ https://reviews.llvm.org/D79972 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits