Hi Quincey, [got the 'e' this time]
The problem cannot be repeated in a c example program because the
problem is upstream use of vector's with a -= operation. The failure in
this area did not show except in this mpi application even though the
same library and code is used in single process application.
...............
scores[qbChemicalShiftTask->getNMRAtomIndices()[index]]-=qbChemicalShiftTask->getNMRTrace()[index];
.............
where scores etc. are std:: vector<double>, std:: vector<int> typedefs.
In certain runs the indices were incorrect so this was badly constructed
and needed to be rewritten to have better alignment of indices to the
previous creation of the scores vector and better checking.
These vectors were not used as input to any hdf5 interfaces; hdf5 looks
clean and stable in the file closing. The problem was entirely in non
hdf5 code but resulted in stepping on the H5SL skip list from my compile
and running of the integrated system.
Thank you for being ready to look into it if it could be duplicated and
shown to be in the H5SL remove area. The problem wasn't in hdf5 code.
On 12/08/2010 10:12 AM, Roger Martin wrote:
Hi Quincy,
I'll be pulling pieces out of the large c++ project into a small test
c program to see if the seg fault can be duplicated in a wieldable
example and if accomplished, will send it to you.
MemoryScape and gdb(Netbeans) doesn't show any memory issues from our
library code and hdf5. MemoryScape doesn't expand through the
H5SL_REMOVE macro so in another working copy I'm trying to treat a
copy of it as a function.
On 12/07/2010 05:20 PM, Quincey Koziol wrote:
Hi Roger,
On Dec 7, 2010, at 2:06 PM, Roger Martin wrote:
Further:
Debugging with MemoryScape:
Reveals a segfault in H5SL.c (1.8.5) at line 1068
...1068....
H5SL_REMOVE(SCALAR, slist, x, const haddr_t, key, -)
//H5SL_TYPE_HADDR case
....
The stack trace is:
H5SL_remove 1068
H5C_flush_single_entry 7993
H5C_flush_cache 1395
H5AC_flush 941
H5F_flush 1673
H5F_dest 996
H5F_try_close 1900
H5F_close 1750
H5I_dec_ref 1490
H5F_close 1951
I'll be adding print outs to see what variable/pointer is causing
the seg fault. The MemoryScape Fame shows:
..............
Stack Frame
Function "H5SL_remove":
slist: 0x0b790fc0 (Allocated) -> (H5SL_t)
key: 0x0b9853f8 (Allocated Interior) ->
0x000000000001affc (110588)
Block "$b8":
_last: 0x0b772270 (Allocated) -> (H5SL_node_t)
_llast: 0x0001affc -> (H5SL_node_t)
_next: 0x0b9855c0 (Allocated) -> (H5SL_node_t)
_drop: 0x0b772270 (Allocated) -> (H5SL_node_t)
_ldrop: 0x0b772270 (Allocated) -> (H5SL_node_t)
_count: 0x00000000 (0)
_i:<Bad address: 0x00000000>
Local variables:
x:<Bad address: 0x00000000>
hashval:<Bad address: 0x00000000>
ret_value:<Bad address: 0x00000000>
FUNC: "H5SL_remove"
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
................
Some bad addresses on some of the variables such as x which was set
by "x = slist->header;" which is a skip list.
These appear to be internal API functions and I'm wondering how I
could be offending them from high level API calls and file
interfaces. What could be in the cache H5C when
H5Fget_obj_count(fileID, H5F_OBJ_ALL) = 1
and H5Fget_obj_count(fileID, H5F_OBJ_DATASET | H5F_OBJ_GROUP |
H5F_OBJ_DATATYPE | H5F_OBJ_ATTR) =0
for the file the code is trying to close.
Yes, you are correct, that shouldn't happen. :-/ Do you have a
simple C program you can send to show this failure?
Quincey
On 12/03/2010 11:33 AM, Roger Martin wrote:
Hi,
Using hdf1.8.5 and 1.8.6 pre2; openmpi 1.4.3 on linux rhel4 and rhel5
In a case where the hdf5 operations aren't using MPI but build an
h5 file exclusive to individual MPI jobs/processes:
The create:
currentFileID = H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC,
H5P_DEFAULT, H5P_DEFAULT);
and many file operations using the hl methods including packet
table, tables and datasets etc. perform successfully.
Then near the individual processes' end the
H5Fclose(currentFileID);
is called but doesn't return. A check for open objects says only
one file object is open but no other objects(group, dataset etc).
No other software or process is acting on this h5; it is named
exclusively for the one job it is associated with.
This isn't a parallel hdf5 in MPI attempt. In another scenario
parallel hdf5 is working the collective way just fine. This
current issue is for people who don't have or want a parallel file
system and I made a coarsed grained MPI to run independent jobs for
these folks. Each job has its own h5 opened with
H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
Where should I look?
I'll try to make a small example test case for show and tell.
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org