Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

Alvaro Tejero Cantero Fri, 21 Sep 2012 02:50:57 -0700

Hi!

You may want to have a look | reuse | combine your approach with that
implemented in pandas (pandas.io.pytables.HDFStore)


https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py

(see _write_array method)

A certain liberality in Pandas with dtypes (partly induced by the
missing data problem) leads to VLArrays being created often that might
be not the most performant solution. But if the types of the columns
in the data frames are guessed right, then CArrays embedded in groups
will be used, as far as I understand (as suggested above).

Best,

 -á.



On 21 September 2012 01:14, Anthony Scopatz <scop...@gmail.com> wrote:
> Luke,
>
> I'd also like to mention, that if you don't want to wait for us to implement
> this we will gladly take contributions ;).  If you need help getting started
> or throughout the process we are also happy to provide that too.  Please
> sign up for PyTables Dev (pytables-...@googlegroups.com) so we move
> implementation discussions away from users.  Clearly, people would benefit
> from you taking this upon yourself, should you choose to accept this
> mission!
>
> Be Well
> Anthony
>
> On Thu, Sep 20, 2012 at 3:26 PM, Josh Ayers <josh.ay...@gmail.com> wrote:
>>
>> Depending on your use case, you may be able to get around this by storing
>> each column in its own table.  That will effectively store the data in
>> column-first order.  Instead of creating a table, you would create a group,
>> which then contains a separate table for each column.
>>
>> If you want, you can wrap all the functionality you need in a single
>> object that hides the complexity and makes it act just like a single table.
>> I did something similar to this recently and it's worked well.  However, I
>> wasn't too concerned with exactly matching the Table API or implementing all
>> of its features.
>>
>> Creating a more general version that does duplicate the Table class
>> interface and can be included in PyTables is definitely possible and is
>> something I'd like to do, but I've never had the necessary time to dedicate
>> to it.
>>
>> Hope that helps,
>> Josh
>>
>>
>>
>> On Wed, Sep 19, 2012 at 10:56 AM, Francesc Alted <fal...@pytables.org>
>> wrote:
>>>
>>> On 9/19/12 3:37 PM, Luke Lee wrote:
>>> > Hi all,
>>> >
>>> > I'm attempting to optimize my HDF5/pytables application for reading
>>> > entire columns at a time.  I was wondering what the best way to go
>>> > about this is.
>>> >
>>> > My HDF5 has the following properties:
>>> >
>>> > - 400,000+ rows
>>> > - 25 columns
>>> > - 147 MB in total size
>>> > - 1 string column of size 12
>>> > - 1 column of type 'Float'
>>> > - 23 columns of type 'Float64'
>>> >
>>> > My access pattern for this data is generally to read an entire column
>>> > out at a time.  So, I want to minimize the number of disk accesses
>>> > this takes and store data contiguously by column.
>>>
>>> To start with, you must be aware that the Table object stores data in
>>> row-order, not column order.  In practice, that means that whenever you
>>> want to access a single column, you will need to traverse the *entire*
>>> table.
>>>
>>> I always wished to implement a column-order table in PyTables, but that
>>> did not happen in the end.
>>>
>>> >
>>> > I think the proper way to do this via HDF5 is to use 'chunking.'  I'm
>>> > creating my HDF5 files via Pytables so I guess using the 'chunkshape'
>>> > parameter during creation is the correct way to do this?
>>>
>>> Yes, it is.
>>>
>>> >
>>> > All of the HDF5 documentation I read discusses 'chunksize' in terms of
>>> > rows and columns.  However, the Pytables 'chunkshape' parameter only
>>> > takes a single number.  I looked through the source and see that I can
>>> > in fact pass a tuple, which I assume is (row, column) as the HDF5
>>> > documentation would suggest.
>>>
>>> Not quite.  The Table object is actually an uni-dimensional beast, but
>>> with a 'compound' datatype (that in some way can be regarded as another
>>> dimension, but it is not a 'true' dimension).
>>>
>>> >
>>> > Is it best to use the 'expectedrows' parameter instead of the
>>> > 'chunkshape' or use both?
>>>
>>> You can try both.  The `expectedrows` parameter was introduced to ease
>>> the life of users, and it 'optimizes' the `chunkshape` but for 'normal'
>>> usage.  For specific requirements, playing directly with the
>>> `chunkshape` normally gives better results.
>>>
>>> >
>>> > I have done some debugging/profiling and discovered that my default
>>> > chunkshape is 321 for this dataset.  I have increased this to 1000 and
>>> > see quite a bit better speeds.  I'm sure I could keep changing these
>>> > numbers and find what is best for this particular dataset.  However,
>>> > I'm seeking a bit more knowledge on how Pytables uses each of these
>>> > parameters, how they relate to the HDF5 'chunking' concept and
>>> > best-practices.  This will help me to understand how to optimize in
>>> > the future instead of just for this particular dataset.  Is there any
>>> > documentation on best practices for using the 'expectedrows' and
>>> > 'chunkshape' parameters?
>>>
>>> Well, there is:
>>>
>>> http://pytables.github.com/usersguide/optimization.html
>>>
>>> but I'm sure you already know this.
>>>
>>> Frankly, if you want to enhance the speed of column retrieval, you are
>>> going to need an object that is stored in column-order.  In this sense,
>>> you may want to experiment with the ctable object in carray package
>>> (https://github.com/FrancescAlted/carray).  It supports barely the same
>>> capabilities than the Table object, but the column-order is implemented
>>> properly, so probably a ctable will buy you a nice speed-up.
>>>
>>> >
>>> > Thank you for your time,
>>>
>>> Hope this helps,
>>>
>>> --
>>> Francesc Alted
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pytables-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://ad.doubleclick.net/clk;258768047;13503038;j?
>> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>>
>> _______________________________________________
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
> ------------------------------------------------------------------------------
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Optimizing pytables for reading entire columns at a time

Reply via email to