Hi all,

I was playing today with the different compression options 
for our data trying to get some optimal numbers. Our data 
are 16-bit images with dynamic range which is limited most 
of the time so I would expect that shuffle filter should
get us some improvement compared to just using plain zlib
compression. To my surprise enabling shuffle did not 
change compression factor at all. Looking at the shuffle 
filter code it seems that the reason for that is the structure 
of our data. The dataset which contains the images is a 
1-dimensional dataspace with each element containing another
2- or 3-dimensional image stack:

DATASET "..." {
   DATATYPE  H5T_ARRAY { [32][185][388] H5T_STD_I16LE }
   DATASPACE  SIMPLE { ( 2132 ) / ( H5S_UNLIMITED ) }
   STORAGE_LAYOUT {
      CHUNKED ( 1 )
      SIZE 6608778714 (1.482:1 COMPRESSION)
    }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 1 }
   }

The size of the image arrays is quite big so the chunks 
fit just one single array most of the time.

My understanding is that shuffle algorithm tries to
re-order bytes form multiple objects in a chunk but because 
there is just one object (which is array) in this case 
it does not do anything at all. What I would like shuffle
to do in this case is to shuffle 16-bit words from the array, 
not to treat the array as a single object but look inside it.

I did some experimenting with the code and with a small change 
to the code I managed to convince it to shuffle things correctly.
The diff is below this message. It does indeed improve 
compression of the image data and the data can be read back 
correctly after de-shuffling with the standard code (h5dump
shows identical results).

It would be really helpful for us if something like this could
be added to HDF5 library. I do not particularly care about other 
types such as compounds, but for the datasets whose elements are 
plain arrays it can probably be done without breaking compatibility.
OTOH if more options could be added to shuffle which control
shuffling of arrays and other types of data it could become even 
more useful.

Thanks,
Andy

----------------------------------------------------------------------
This is the change applied to 1.8.6 code:

*** H5Zshuffle.c.orig   2011-02-14 08:23:19.000000000 -0800
--- H5Zshuffle.c        2011-09-06 17:18:13.022259993 -0700
***************
*** 88,93 ****
--- 88,98 ----
      if(H5P_get_filter_by_id(dcpl_plist, H5Z_FILTER_SHUFFLE, &flags, 
&cd_nelmts, cd_values, (size_t)0, NULL, NULL) < 0)
        HGOTO_ERROR(H5E_PLINE, H5E_CANTGET, FAIL, "can't get shuffle 
parameters")
  
+     /* If object is an array use its base type */
+     while (H5T_get_class(type, FALSE) == H5T_ARRAY) {
+         type = H5T_get_super(type);
+     }
+ 
      /* Set "local" parameter for this dataset */
      if((cd_values[H5Z_SHUFFLE_PARM_SIZE] = (unsigned)H5T_get_size(type)) == 0)
        HGOTO_ERROR(H5E_PLINE, H5E_BADTYPE, FAIL, "bad datatype size")


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to