Thank you for the information. I will run a few more tests over the next
couple of days, one day with no compression, and one day with a chunksize
similar to what will be appended each cycle, hopefully I will get a chance
to report back.
A ptrepack into a file with no compression is half the size of its
append/compress/lots of unused space counterpart. The reason for using
compression is to reduce the IO required from the network backed storage,
not necessarily reduce disk space, although that is a plus.
--
Thadeus
On Thu, Mar 7, 2013 at 5:40 PM, Anthony Scopatz <scop...@gmail.com> wrote:
> Hi Thadeus,
>
> HDF5 does not guarantee that the data is contiguous on disk between
> blocks. hat is, there may be empty space in your file. Furthermore,
> compression really messes with HDF5's ability to predict how large blocks
> will end up being. To avoid accidental data loss, HDF5 tends to over
> predict the empty buffer space needed.
>
> Thus my guess is that by having this tight loop around open/append/close,
> you keep accidentally triggering extraneous buffer space. You basically
> have two options:
>
> 1. turn off compression. size prediction is exact without it.
> 2. periodically run ptrepack. (every 10, 100, 1000 cycles? end of the
> day?)
>
> Hope this helps
> Be Well
> Anthony
>
>
> On Thu, Mar 7, 2013 at 5:26 PM, Thadeus Burgess <thade...@thadeusb.com>wrote:
>
>> I have a PyTables file that receives many appends to a Table throughout
>> the day, the file is opened, a small bit of data is appended, and the file
>> is closed. The open/append/close can happen many times in a minute.
>> Anywhere from 1-500 rows are appended at any given time. By the end of the
>> day, this file is expected to have roughly 66000 rows. Chunkshape is set to
>> 1500 for no particular reason (doesn't seem to make a difference, and some
>> other files can be 5 million/day). BLOSC with lvl 9 compression is used on
>> the table. Data is never deleted from the table. There are roughly 12
>> columns on the Table.
>>
>> The problem is that at the end of the day this file is 1GB in size. I
>> don't understand why the file is growing so big. The tbl.size_on_disk shows
>> a meager 20MB.
>>
>> I have used ptrepack with --keep-source-filters and --chunkshape=keep.
>> The new file is only 30MB in size which is reasonable.
>> I have also used ptrepack with --chunkshape=auto and although it set the
>> chunkshape to around 388, there was no significant change in filesize from
>> chunkshape of 1500.
>>
>> Is pytables not re-using chunks on new appends. When 50 rows are
>> appended, is it still writing a chunk sized for 1500 rows. When the next
>> append comes along, it writes a brand new chunk instead of opening the old
>> chunk and appending the data?
>>
>> Should my chunksize really be "expected rows to append each time" instead
>> of "expected total rows"?
>>
>> --
>> Thadeus
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
>> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
>> endpoint security space. For insight on selecting the right partner to
>> tackle endpoint security challenges, access the full report.
>> http://p.sf.net/sfu/symantec-dev2dev
>> _______________________________________________
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users