Re: [Hdf-forum] Problems with H5PT from java

Francesc Alted Mon, 12 Apr 2010 11:48:01 -0700

A Monday 12 April 2010 18:40:33 Stamminger, Johannes escrigué:
> Fine to have some more feedback! :-)
> 
> 
> Did the attached spreadsheet make it's way to the forum? Or was it
> filtered?


Yes, it made into the list.  It is a small file and OpenOffice can open it 
easily, so I suppose that it is fine if you send more of these (although if 
you can come up with a PDF file would be better).

> > That could be consequence of the cache disk subsystem of the OS that is
> > working on.  If you want to get better reproducibility on your results,
> > try to flush the OS cache (sync with UNIX-like OS) before taking time
> > measurements. Of course, you may not be interested in measuring your disk
> > I/O, but only the disk cache subsystem throughput, but this is always
> > tricky to do.
> 
> Maybe. Though I never before noticed such a variance (and never thought
> of any explicite sync'ing) ...
> 
> Btw: I'm running an 64bit linux (latest ubuntu) with raid0 filesystem.
> But I use the 32bit version of the hdf library.
> 
> And additionally please note that I run the tests from a JAVA unit test!

Please don't take my words as if they are absolute truth.  I'm just talikng 
about my own experience, and when doing benchmarks, you should be aware that 
there is a huge difference when your dataset fits in cache and when it 
doesn't.  That *maybe* affecting you.

> > Don't know about this one, but it is certainly strange this dramatic loss
> > in performance when passing from 64 KB to 128 KB chunksize.  It would be
> > nice if you can build a small benchmark showing this problem in
> > performance and send it to the HDF group for further analysis.
> 
> I may extract this test with some small much effort. But it is java then
> wrapping the native shared libraries. And it is *not* hdf-java as this
> does not support the H5PT but using JNA for that purpose.
> 
> Still interested?

I suppose so, but you should ask the THG helpdesk just to be sure ;-)

> I'm still measuring - but I was supprised again from the findings. E.g.
> with arrays of size 16384 it seems best to use chunksize 32, compression
> level 4 and to write as much as possible (maybe there is an upper limit
> that I did not reach, yet) arrays with a single call to H5PTappend. With
> that I get the data written in 217s to a file of size 160MB.
> 
> The data is the same as I used for writing the strings. But now without
> conversion to hex string, 468M bytes in sum. With the overhead of the
> fixed length arrays total data written to the file is of size 16,2G (the
> overhead bytes are zero'ed). With the latter in mind the resulting
> filesize of 160M is quite imaginable. But compared with writing same
> data to a zip with on-the-fly inflation it is not as this leads to 50M
> in 65s (with no performance tuning like writing data in blocks etc) ...

Well, 160 MB wrt 468 MB is quite fine.  Indeed zip is compressing better 
because of a series of reasons.  First one is that zip is probably using 
larger block sizes here.  In addition, HDF5 is designed to be able to look at 
each chunk directly, not sequentially, so it has to add some overhead (in the 
form of a B-tree) to quickly locate chunks; you cannot (as far as I know) do 
the same with zip.  Finally, have in mind that you are actually compressing 
16.2 GB instead of 468 MB.  And although most of the 16.2 GB are zeros, the 
compressor still have to walk, chew and code them.  So, you can never expect 
to have the same speed/compression ratio than zip in this scenario.

But I'm curious when you say that you were converting data to hex string.  Why 
were you doing so?  If your data are typically ints or floats, you may want to 
use the shuffle filter in combination with zlib.  In many circumstances, 
shuffle may buy you significant additional compression ratio.  This is 
something that zip cannot do (it can only compress streams of bytes, as it has 
not the notion of ints/floats).

> With big chunksize both, performance and file size, degrade by a large
> factor. Worst example was to have 819K leading to a file of 513M (50
> arrays with 16384 bytes each, compression 0, chunk size 32K).

Uh, you lost me.  What is 819 KB, the chunksize?

> > sending strings to HDF5 containers be sure to zero the memory buffer area
> > after the end of the string: this could improve compression ratio quite a
> > lot.
> 
> I'm using H5PT - I do not see any method for doing such thing there.
> What method did you think of?

I was talking about zero'ing the overhead bytes on each string, but I see that 
you are doing this already.

-- 
Francesc Alted

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Problems with H5PT from java

Reply via email to