Re: [Hdf-forum] RFC: libHDF5 to support row and column major storage?

Werner Benger Tue, 09 Jun 2015 03:59:31 -0700

Basically what it needs is a convention such as an attribute to allowidentifying in which permutation order a dataset is stored...


As they say in


https://www.hdfgroup.org/HDF5/doc/fortran/index.html

"When a C application reads data stored from a Fortran program, the datawill appear to be transposed due to the difference in the C and Fortranstorage orders. For example, if Fortran writes a 4x6 two-dimensionaldataset to the file, a C program will read it as a 6x4 two-dimensionaldataset into memory. The HDF5 C utilities h5dump and h5ls will alsodisplay transposed data, if data is written from a Fortran program. "

But there is no way to find out whether data had been stored by a C orFortran program. A simple agreement on an attribute would do, evenbetter shared dataspaces that can hold such an attribute.

All the index-permutation or data transposing (if really required) canbe in some add-on library on top of HDF5 (similar to what F5 does,though F5 does more than just that).


     Werner


On 09.06.2015 11:00, Jason Newton wrote:

Was hoping more commentary would have happened but I also had sometiming issues getting back to this, my apologies.
Werner, thank you for you reply but your case is exactly the proof ofthis as an issue that should be dealt with at the specification &library level that I am talking about. Permuting indices wheneveraccessing data is a large burden to put on user code, especiallyconsidering how many different bindings one might use to access thedata. It leads to repeating and intrusive handling which is not whatthe user should be dealing with. It's tricky, automatable, isolatable(to the library), difficult out of C (at least in python), and notwhat the tasks they should be spending time on using an advancedsoftware like HDF5.
If we look at the example of Eigen and Numpy we can see they haveflags set for dealing with column/row [http://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html ]and c/fortran [ see order argument:http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html &http://docs.scipy.org/doc/numpy/reference/c-api.array.html ]. Thisshows at least some numerical processing code deemed it importantenough to not only deal with the issue, but usually provide seamlessusage or conversion to the user's desired type.
I think defaults can be set to not change current behaviour but thatdatasets & arrays could now be marked with a flag such as python's.When reading/writing, an optional flag is provided for the memoryspace's requested interpretation (default to C or Fortran by languagecontext). We could potentially put this in the dataset properties andtype properties so we wouldn't have to change API. And ideally,hopefully performance being pretty great and handled in C, the librarypermutes the storage for you as it's IOing it in for hopefullynegligible performance bump since IO is likely the limiting factor.
I brought this up because I'm writing a generalized HDF C++ libraryand when trying to support something like Eigen (and more!), whichallows both C and F orders in the same runtime, it gets confusing onhow to IO to/from HDF files as the current approach relies on languagelevel wrappers to decide what the right thing to do is, and weakly atthat. But the user may genuinely want to IO in/out a fortran or Cordered dataset/array to/from a C/fortran dataset/array in anycombination for what makes sense to them and this doesn't reallywork. I can be left with baffling scenarios like this failing unlessall data written to HDF files is in C order.:
    Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero();
    A_c.row(i) = 5;
    Eigen::Matrix<double, 4, 5, ColMajor> A_f;
    hdf.write("A", A_c);
    hdf.read("A", A_f);
    assert(A_c == A_f);
If in this scenario A was already written by a Fortran program, thencode making the above test case work would apply a conversion wherenone is needed for a read like this, making this test cases' assertionfail:
    Eigen::Matrix<double, 4, 5, RowMajor> A_c; A_c.setZero();
    A_c.row(i) = 5;
    Eigen::Matrix<double, 4, 5, ColMajor> A_f;
    hdf.read("A", A_f);
    assert(A_c == A_f);
And that's why flags need to be saved in the document... the contentneeds to specify it's storage layout - guessing based on languagecannot cover all cases and user made attributes are not the waybecause that would a be a standard nobody knows about or will use.
-Jason
On Tue, May 12, 2015 at 12:16 AM, Werner Benger <[email protected]<mailto:[email protected]>> wrote:
    Hi Jason,

     I was facing the same issues as pretty much all use case I know
    and have in my visualization software and context use and require
    "fortran" order of indexing, including OpenGL graphics. It's not
    really an issue with HDF5 as the only thing required is to permute
    the indices when accessing the HDF5 API. And the HDF5 tools of
    course will display data transposed then. This index permutation
    is supported in the F5 library via a generic permutation vector
    that is stored with a group of dataset sharing the same properties
    (the F5 library is a C library on top of HDF5 guiding towards a
    specific data model for various classes of data types occurring
    particularly in scientific visualization):

    http://www.fiberbundle.net/doc/structChartDomain__IDs.html

    So via the F5 API one would see the fortran-like indexing
    convention, whereas whenever accessing data with the lower-level
    HDF5 API, it's C-like convention (whereby the permutation vector
    gives the option of arbitrary permutations).

    I remember there had been plans by the HDF5 group to introduce
    "named dataspaces", similarly to "named datatypes", that could
    then be stored in the file as its own entity. Such would be a good
    place to store properties of a dataspace as attributes on a
    dataspace, and to have such shared among datasets. It would be a
    natural place to store a permutation vector, which could be
    reduced to a simple flag as well to just distinguish between C and
    fortran indexing conventions. Of course, all the related tools
    would also need to honor such an attribute then. Until then, one
    could use an attribute on each dataset and implement index
    permutation similar to how the F5 library does it. It may be safer
    to use new API functions anyway to not break old code that always
    expects C order indexing.

              Werner


    On 12.05.2015 06:48, Jason Newton wrote:
    Hi -

    I've been a evangelist for HDF5 for a few of years now, it is a
    noble and amazing library that solves data storage issues
    occurring with scientific and beyond applications - e.g. it can
    save many developers from wasting time and money so they can
    spend that on solving more original problems.  But you guys knew
    that already.  I think there's been a mistake though - that is
    the lack of first class column-vs-row major storage.  In a world
    where we are split down the middle on what format we used based
    on what application, library and language we use we work in one
    or the other it is an ongoing reality that there will never be
    one true standard to follow.  But HDF5 sought to only support
    row-major - and I can back that up - standardizing is a good
    thing.  But then as time has shown, that really didn't work for
    alot of folks - such as those in Matlab and fortran - when they
    read our data - it looks transposed to them!  When HDF5 utils/our
    code sees their data - it looks transposed to us!  These are
    arguably the users you do not want to face these difficulties  as
    it makes it down right embarrassing at times and hard to work
    around in within that language (ahem, Matlab again is painful to
    work with).  Not only that but it doesn't really scale - it will
    always take some manual fixing and there's no standardized mark
    for whether a dataset is one of these column major masquerading
    datasets.  So let me assure you this is quite ugly to deal with
    in Matlab/etc and doesn't seem to be the path many people take -
    and it can require skills many people don't have or understanding
    that they can't give.

    But then, why did we allow saving column major data in a row
    based standard in the first place?  Well, the answer seems to be
    performance.  Surely it can't take that long to convert the
    datasets - most of the time at least - although there would for
    sure be some memory based limitations to allow transposing just
    as HDF IOs. But alas - the current state of the library indicates
    otherwise and thus is the users job to handle correctly
    transforming the data back and forth between application and
    party.  But wait - wasn't this kind of activity what HDF5 was
    built to alleviate in the first place?

    So then how do we rectify the situation? Well speaking as a
    developer using HDF5 extensively and writing libraries for it -
    it looks to me it should be in the core library as it is
    exceedingly messy to handle on the user side each time.  I think
    the interpretation of the dataset and it's dimensions should be
    based on dataset creation properties.  This would allow an
    official marking of what kind of interpretation the raw storage
    of the data (and dimensions?) are.  However, this is only half of
    the battle.  We'd need something like the type conversion system
    to permute order in all the right places if the user needs to IO
    an opposing storage layout.  And it should be fast and light on
    memory.  Perhaps it would merely operate inplace as a new utility
    subroutine taking in the mem_type and user memory. However I can
    still think of one problem this does not address: compound types
    using  a mixture of philosophies with fields being the opposite
    to the dataset layout - and this case has me completely stumped
    as this indicates it should be type level as well.  The compound
    part of this is a sticky situation but I'd still motion that the
    dataset creation property works for most things that occur in
    practice.

    So... has the HDF5 group tried to deal with this wart yet?  Let
    me know if anything is on the drawing board.


    -Jason


    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected]  <mailto:[email protected]>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter:https://twitter.com/hdf5
--___________________________________________________________________________
    Dr. Werner Benger                Visualization Research
    Center for Computation & Technology at Louisiana State University (CCT/LSU)
    2019  Digital Media Center, Baton Rouge, Louisiana 70803
Tel.:+1 225 578 4809 <tel:%2B1%20225%20578%204809> Fax.:+1 225 578-5362 <tel:%2B1%20225%20578-5362>
    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected] <mailto:[email protected]>
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5




_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


--
___________________________________________________________________________
Dr. Werner Benger                Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019  Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] RFC: libHDF5 to support row and column major storage?

Reply via email to