Re: [Pytables-users] [hdf-forum] Fwd: Re: Reading Fortran arrays with correct array indexing

Francesc Alted Mon, 02 Jun 2008 02:49:32 -0700

Elena,

Thanks for your detailed explanation.  After reading it, and in my 
opinion, it is not completely true that HDF5 does implement the 
C/Fortran ordering meta-information (even in a more abstract way, as 
you said), because, as it is now, there is always an ambiguity on how 
to interpret the ordering of data on-disk.


I think the key point in your exposition can be resumed in the next 
sentence:

"""
Therefore HDF5 Fortran library instructs C library to store  
K,M,N values in the dataspace object header instead of N,M,K, since N  
is the size of the fastest changing dimension.
"""

So, what HDF5 is actually ensuring is the consistency between the the 
order of the dimensions in the dataspace and *fastest changing 
dimension* ordering in memory of the user datasets, but not the 
absolute *C/Fortran* ordering.  This is what leads to the reported 
ambiguity in the dimension ordering of the datasets when you try to 
read an HDF5 file that was written in Fortran from a C-based program 
(or vice versa).

At first sight, I'd have preferred that, provided that HDF5 has a C 
ordering convention, when someone is making use of the Fortran 
wrappers, that HDF5 itself would have transposed the *data* to be saved 
(instead of just "tranposing" the *dimension ordering*), so that the 
interpretation of both data and dimensionality ordering would have been 
completely unambiguous.  However, I guess that you have chosen not do 
that in order to not penalize the performance of Fortran users 
(transposing the data is a quite costly operation).  In some way, you 
have sacrificed data portability between C/Fortran users for the sake 
of performance, and I agree that this is a sensible approach for an 
efficient library like HDF5 tries to be.

Having said that, and although HDF5 already does a terrific work in 
terms of cross-platform data portability by supporting metadata 
information for platform independent data types (including endianess), 
failing to support specific metadata about C/Fortran ordering is, IMHO, 
a serious design fault in terms of portability.  That could easily be 
solved by adding the C/Fortran metadata, so that users can easily 
identify the original *intended* data ordering, and give them a chance 
to correctly interpret that ordering.  That way, they would be able to 
choose whether to transpose the *data* at loading time in order 
efficiently deal with that data in-memory or just add some metainfo to 
their data containers (for example, NumPy does support this) stating 
that the in-memory ordering is different than the native one for the 
reading platform.

Moreover, providing this C/Fortran ordering metadata is completely 
backward compatible, so my vote is +1 for HDF5 supporting it in the 
future.

Thanks,
  Francesc

A Sunday 01 June 2008, Elena Pourmal escrigué:
> Hi Francesc and All,
>
> If you only knew how many times this question was asked and how many
> attempts were done to explain :-)  Here is another one. It is little-
> bit lengthy, please forgive :-)
> But I hope it will shed a light on why HDF5 doesn't support meta-
> information for Fortran/C ordering in datasets (the short answer is -
> it actually does, but in a more abstract way).
>
>
> HDF5 is a "self-describing" format, which means that HDF5 metadata
> stored in a dataset object header allows the HDF5 C library and any
> other non-C applications built on top of it, to retrieve a raw data
> (i.e. elements of a multidimensional array) in the correct order.
>
> (Let's for a second forget about HDF5, C and Fortran, Python and
> Matlab :-) )
>
> If we have a matrix A(N,M,K), we usually count dimensions from left
> to right saying that the first dimension has size N, the second
> dimension has size M, the third dimension has size K, and so on.
>
> (Now let's talk about HDF5 but without referring to any language.)
>
> When we describe a matrix using HDF5 datatspace object, we use the
> same convention  (i.e. specifying dimensions from left to right): the
> first dimension has size N, the second dimension has size M,  the
> third dimension has size K. (Aside: Please notice that this
> description is valid for both C and Fortran HDF5 applications, i.e. C
> and  Fortran dims array needed by H5Screate_simple
> (h5screate_simple_f) will have the values dims [] = {N,M,K}).
>
> The question is: how does HDF5 know how to interpret a blob of  {N x
> M x K x by sizeof(datatype)}  bytes of dataset raw data stored in the
> file? Was A(N,M,K) stored? Or was it A(K,N,M) stored? Or any other
> permutation of (K,N,M)?
>
> HDF5 file has no clue about matrices and their dimensions, and the
> languages they were written from. This is application's
> responsibility to interpret data correctly and pass the  correct
> interpretation to the HDF5 C library to store in a file.
>
> As it was mentioned above, dimensions of the matrix are described
> using HDF5 dataspace object and are stored in the file.  d integers
> P1, ..., Pd, where d is a rank of a matrix, are stored in a dataspace
> object header according to the following convention:  the last value
> - Pd is the size of the FASTEST changing dimension of the matrix,
> i.e. HDF5 file spec  and HDF5 C library follow C storage convention
> (no wonder, it is a C library :-). Therefore there is no ambiguity in
> interpreting  {N x M x K x sizeof(datatype)} bytes, and HDF5 file has
> enough information to interpret data correctly by any "row-major" or
> "column-major" application (including bypassing HDF5 C library and
> reading directly from the HDF5 file!)
>
> Here is what is happening when HDF5 Fortran library is used:
>
> Suppose we want to write A(N,M,K) matrix to the HDF5 file.  HDF5
> Fortran API describes dataspace with the first dimension being N, the
> second dimension being M, the third dimension being K (as we would do
> it in C and any other language).  But HDF5 Fortran API also knows
> that the fastest changing dimension has size N (i.e. we have
> column-major order). Therefore HDF5 Fortran library instructs C
> library to store K,M,N values in the dataspace object header instead
> of N,M,K, since N is the size of the fastest changing dimension.
>
> So, if we read matrix A(N,M,K) ((i.e. N x M x K x sizeof(datatype)
> blob) written from Fortran by a C application, we will  read it to
> the matrix B(K,M,N) ( C API that requests sizes of the first, second
> and third dimensions will return values K,M,N stored in the dataspace
> header)
>
> If we read matrix A(N,M,K) written from Fortran by Fortran
> application, we will read it once again into B(N,M,K) ( Fortran API
> that requests sizes of the first, second and third dimension will
> flip an array K,M,N stored in the file and return N,M,K)
>
> In other words: HDF5 library stores information about how to
> interpret data. Interpretation follows C storage convention: the last
> dimension specified for the dataspace object is the fastest changing
> one. It is the responsibility of the application (in this case
> FORTRAN HDF5 library) to interpret correctly the order of dimensions
> and pass to/ from the HDF5 C library.
>
> Please notice that there is no need to transpose data itself: one
> only has to pass a correct interpretation of the data to the HDF5 C
> Library  and to make sure it is done according to the HDF5 C library
> convention - the first value stored in the dataspace header
> corresponds to the slowest changing dimension, ...., the last value
> stored in the dataspace header corresponds to the fastest changing
> dimension).
>
> Please let me know if my explanation made things worse.  Frankly
> speaking I think it did ;-) but I tried.....
>
> Elena
>
> On May 31, 2008, at 4:54 AM, Francesc Alted wrote:
> > Hi,
> >
> > An HDF5/PyTables user asked whether HDF5 supports meta-information
> > for keeping Fortran/C ordering in datasets.  By reading the docs,
> > it seems to me that HDF5 doesn't support this yet.  Are there plans
> > to support this feature?
> >
> > Thanks,
> >
> > ----------  Missatge transmès  ----------
> >
> > Subject: Re: [Pytables-users] Reading Fortran arrays with correct
> > array
> > indexing
> > Date: Saturday 31 May 2008
> > From: "Milos Ilak" <[EMAIL PROTECTED]>
> > To: "Francesc Alted" <[EMAIL PROTECTED]>
> >
> > Hi Francesc,
> >
> > thanks a lot! I didn't know MATLAB used Fortran order too. My
> > Python code needs to read in files written in both orders, so I
> > just added an attribute in my Fortran output routine which the
> > Python code looks for and if it is there, it transposes the data
> > after loading.
> >
> > I would have thought that the meta-information about the order
> > would be
> > stored somewhere in the file. Do you know if the future versions of
> > HDF5
> > will support this? Thanks again,
> >
> > Milos
> >
> >
> > On Fri, May 30, 2008 at 8:22 AM, Francesc Alted
> > <[EMAIL PROTECTED]>
> >
> > wrote:
> >> A Thursday 29 May 2008, Milos Ilak escrigué:
> >>> Hi all,
> >>>
> >>> I apologize if this has been discussed, but I could not find any
> >>> information in the archives. I am creating HDF5 files with 3-D
> >
> > arrays
> >
> >>> in Fortran 90, and I need to read them in both Python and MATLAB.
> >>> While MATLAB recognizes the correct dimensions of the arrays,
> >>> PyTables gets them backwards (i.e. (x,y,z) in Fortran becomes
> >
> > (z,y,x)
> >
> >>> when PyTables reads it). I know that this is due to the fact that
> >
> > the
> >
> >>> order in which Fortran stores arrays is different than that of
> >>> Python, C or MATLAB, and I couldn't determine how exactly MATLAB
> >>> 'knows' that Fortran arrays are being read.
> >>
> >> Well, it is easy: because MATLAB writes and reads arrays in
> >> *Fortran* order.  So, if you write your arrays with Fortran, then
> >> you are not going to have any problem to read them in the correct
> >> order from MATLAB.  However, as PyTables uses a C API to access
> >> HDF5 files, and
> >
> > as
> >
> >> C follows a different order for matrices in memory, you will get
> >> inverted dimensions for your Fortran created files (as it is the
> >
> > case).
> >
> >>> I have tried using the
> >>>
> >>> 'isfortran' command in numpy, but I get the following error:
> >>>>>> hh5f.root
> >>>
> >>> / (RootGroup) ''
> >>>  children := ['eta' (Array), 'u' (Array), 'w' (Array), 'v'
> >>> (Array), 'y' (Array), 'x' (Array), 'z' (Array)]
> >>>
> >>>>>> hh5f.root.v
> >>>
> >>> /v (Array(16L, 33L, 32L)) ''
> >>>  atom := Float64Atom(shape=(), dflt=0.0)
> >>>  maindim := 0
> >>>  flavor := 'numpy'
> >>>  byteorder := 'little'
> >>>  chunkshape := None
> >>>
> >>>>>> numpy.isfortran(hh5f.root.v)
> >>>
> >>> Traceback (most recent call last):
> >>>  File "<stdin>", line 1, in <module>
> >>>  File "/sw/lib/python2.5/site-packages/numpy/core/numeric.py",
> >>> line 184, in isfortran
> >>>    return a.flags.fnc
> >>> AttributeError: 'Array' object has no attribute 'flags'
> >>>
> >>> It seems like there is perhaps some kind of flag I should add
> >>> when writing in Fortran to indicate that the array is in Fortran
> >>> order, but MATLAB somehow seems to know that anyway. Any advice
> >>> would be greatly appreciated.
> >>
> >> You are applying the numpy isfortran() function to a pytables
> >> Array
> >
> > and
> >
> >> not a numpy object.  The correct call would be:
> >>>>> numpy.isfortran(hh5f.root.v[:])
> >>
> >> because the result of reading a pytables Array is a numpy object.
> >>
> >> However, this won't tell you anything about the actual order
> >> (Fortran
> >
> > or
> >
> >> C) in which the array was written because this meta-information is
> >> not
> >> saved anywhere in the file (apparently HDF5 does not support this
> >
> > yet).
> >
> >> So, unless you want to provide this info yourself by using, say,
> >> an
> >
> > HDF5
> >
> >> attribute, your best bet is to *deduce* the ordering by knowing
> >> that the file comes from a Fortran or a C program and *transpose*
> >> manually your arrays after reading them (if you need to).
> >>
> >> Hope this helps,
> >>
> >> --
> >> Francesc Alted
> >> Freelance developer
> >> Tel +34-964-282-249
> >>
> >> ------------------------------------------------------------------
> >>------- This SF.net email is sponsored by: Microsoft
> >> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >> _______________________________________________
> >> Pytables-users mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> > -------------------------------------------------------
> >
> > --
> > Francesc Alted
> > Freelance developer
> > Tel +34-964-282-249
> >
> > -------------------------------------------------------------------
> >--- This mailing list is for HDF software users discussion.
> > To subscribe to this list, send a message to
> > [EMAIL PROTECTED] .
> > To unsubscribe, send a message to
> > [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
>- This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> [EMAIL PROTECTED] To unsubscribe, send a message to
> [EMAIL PROTECTED]



-- 
Francesc Alted
Freelance developer
Tel +34-964-282-249

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] [hdf-forum] Fwd: Re: Reading Fortran arrays with correct array indexing

Reply via email to