Re: [Pytables-users] Append only file is growing in size like crazy

Thadeus Burgess Thu, 07 Mar 2013 17:14:47 -0800

Thank you for the information. I will run a few more tests over the next
couple of days, one day with no compression, and one day with a chunksize
similar to what will be appended each cycle, hopefully I will get a chance
to report back.


A ptrepack into a file with no compression is half the size of its
append/compress/lots of unused space counterpart. The reason for using
compression is to reduce the IO required from the network backed storage,
not necessarily reduce disk space, although that is a plus.


--
Thadeus



On Thu, Mar 7, 2013 at 5:40 PM, Anthony Scopatz <[email protected]> wrote:

> Hi Thadeus,
>
> HDF5 does not guarantee that the data is contiguous on disk between
> blocks.  hat is, there may be empty space in your file.  Furthermore,
> compression really messes with HDF5's ability to predict how large blocks
> will end up being.  To avoid accidental data loss, HDF5 tends to over
> predict the empty buffer space needed.
>
> Thus my guess is that by having this tight loop around open/append/close,
> you keep accidentally triggering extraneous buffer space.  You basically
> have two options:
>
> 1. turn off compression.  size prediction is exact without it.
> 2. periodically run ptrepack. (every 10, 100, 1000 cycles? end of the
> day?)
>
> Hope this helps
> Be Well
> Anthony
>
>
> On Thu, Mar 7, 2013 at 5:26 PM, Thadeus Burgess <[email protected]>wrote:
>
>> I have a PyTables file that receives many appends to a Table throughout
>> the day, the file is opened, a small bit of data is appended, and the file
>> is closed. The open/append/close can happen many times in a minute.
>> Anywhere from 1-500 rows are appended at any given time. By the end of the
>> day, this file is expected to have roughly 66000 rows. Chunkshape is set to
>> 1500 for no particular reason (doesn't seem to make a difference, and some
>> other files can be 5 million/day). BLOSC with lvl 9 compression is used on
>> the table. Data is never deleted from the table. There are roughly 12
>> columns on the Table.
>>
>> The problem is that at the end of the day this file is 1GB in size. I
>> don't understand why the file is growing so big. The tbl.size_on_disk shows
>> a meager 20MB.
>>
>> I have used ptrepack with --keep-source-filters and --chunkshape=keep.
>> The new file is only 30MB in size which is reasonable.
>> I have also used ptrepack with --chunkshape=auto and although it set the
>> chunkshape to around 388, there was no significant change in filesize from
>> chunkshape of 1500.
>>
>> Is pytables not re-using chunks on new appends. When 50 rows are
>> appended, is it still writing a chunk sized for 1500 rows. When the next
>> append comes along, it writes a brand new chunk instead of opening the old
>> chunk and appending the data?
>>
>> Should my chunksize really be "expected rows to append each time" instead
>> of "expected total rows"?
>>
>> --
>> Thadeus
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
>> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
>> endpoint security space. For insight on selecting the right partner to
>> tackle endpoint security challenges, access the full report.
>> http://p.sf.net/sfu/symantec-dev2dev
>> _______________________________________________
>> Pytables-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Pytables-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Append only file is growing in size like crazy

Reply via email to