Hi Thadeus,

HDF5 does not guarantee that the data is contiguous on disk between blocks.
 hat is, there may be empty space in your file.  Furthermore, compression
really messes with HDF5's ability to predict how large blocks will end up
being.  To avoid accidental data loss, HDF5 tends to over predict the empty
buffer space needed.

Thus my guess is that by having this tight loop around open/append/close,
you keep accidentally triggering extraneous buffer space.  You basically
have two options:

1. turn off compression.  size prediction is exact without it.
2. periodically run ptrepack. (every 10, 100, 1000 cycles? end of the day?)

Hope this helps
Be Well
Anthony


On Thu, Mar 7, 2013 at 5:26 PM, Thadeus Burgess <thade...@thadeusb.com>wrote:

> I have a PyTables file that receives many appends to a Table throughout
> the day, the file is opened, a small bit of data is appended, and the file
> is closed. The open/append/close can happen many times in a minute.
> Anywhere from 1-500 rows are appended at any given time. By the end of the
> day, this file is expected to have roughly 66000 rows. Chunkshape is set to
> 1500 for no particular reason (doesn't seem to make a difference, and some
> other files can be 5 million/day). BLOSC with lvl 9 compression is used on
> the table. Data is never deleted from the table. There are roughly 12
> columns on the Table.
>
> The problem is that at the end of the day this file is 1GB in size. I
> don't understand why the file is growing so big. The tbl.size_on_disk shows
> a meager 20MB.
>
> I have used ptrepack with --keep-source-filters and --chunkshape=keep. The
> new file is only 30MB in size which is reasonable.
> I have also used ptrepack with --chunkshape=auto and although it set the
> chunkshape to around 388, there was no significant change in filesize from
> chunkshape of 1500.
>
> Is pytables not re-using chunks on new appends. When 50 rows are appended,
> is it still writing a chunk sized for 1500 rows. When the next append comes
> along, it writes a brand new chunk instead of opening the old chunk and
> appending the data?
>
> Should my chunksize really be "expected rows to append each time" instead
> of "expected total rows"?
>
> --
> Thadeus
>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to