So I think the distinction here is between "participating" for synchronization purposes and having a nonzero slice of data locally. I think (correct me if I'm wrong) that all ranks have to call H5Dwrite() even if they have called H5Sselect_none() on the filespace. That will cause them to send metadata describing their zero-sized contributions to shared chunks to the rank 0 coordinator. They won't get chosen as the new owner, but their metadata will be included in the chunk_entry list sent from rank 0 to the new owner, which means they will be expected to send chunks to the new owner. The crash happens when these zero-sized chunks are decoded by the filter plugin; even if I stop the plugin itself from crashing, it has to return size 0 to H5Z_pipeline(), which interprets that as filter failure and crashes out in H5Z.c line 1256.
That's something I can probably work around, but before I go too far down that road, I'd love it if you could correct any misapprehensions in this. Is it the case that all ranks have to call H5Dwrite()? Is there a way to know what the uncompressed data size will be, and skip the zero-sized chunk_entry units somewhere up the stack? On Thu, Nov 9, 2017 at 12:21 PM, Jordan Henderson <jhender...@hdfgroup.org> wrote: > In the H5D__link_chunk_filtered_collective_io() function, all ranks (after > some initialization work) should first hit > H5D__construct_filtered_io_info_list(). Inside that function, at line 2741, > each rank counts the number of chunks it has selected. Only if a rank has > any selected should it then proceed with building its local list of chunks. > At that point, all the ranks which aren't participating should skip this and > wait for the other ranks to get done before everyone participates in the > chunk redistribution. Then, the non-participating ranks shouldn't have any > chunks assigned to them since they could not be considered among the crowd > of ranks writing the most to any of the chunks. They should then return from > the function back to H5D__link_chunk_filtered_collective_io(), with > chunk_list_num_entries telling them that they have no chunks to work on. At > that point they should skip the loop at 1471-1474 and wait for the others. > The only case I can currently imagine where the chunk redistribution could > get confused would be where no one at all is writing to anything. > Multi-chunk I/O specifically handles this but I'm not sure if Link-chunk I/O > will handle the case as well as Multi-Chunk does. > > This is all of course if I understand what you mean by the zero-sized > chunks, which I believe I understand due to the fact that your file space > for the chunks is positive in size. _______________________________________________ Hdf-forum is for HDF software users discussion. Hdf-forum@lists.hdfgroup.org http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5