Re: [Pytables-users] Large (to very large) datasets...

Francesc Alted Wed, 31 Oct 2012 13:05:58 -0700

On 10/31/12 4:02 PM, Francesc Alted wrote:
> On 10/31/12 10:12 AM, Andrea Gavana wrote:
>> Hi Francesc & All,
>>
>> On 31 October 2012 14:13, Francesc Alted wrote:
>>> On 10/31/12 4:30 AM, Andrea Gavana wrote:
>>>> Thank you for all your suggestions. I managed to slightly modify the
>>>> script you attached and I am also experimenting with compression.
>>>> However, in the newly attached script the underlying table is not
>>>> modified, i.e., this assignment:
>>>>
>>>> for p in table:
>>>>       p['results'][:NUM_SIM, :, :] = 
>>>> numpy.random.random(size=(NUM_SIM,
>>>> len(ALL_DATES), 7))
>>>>       table.flush()
>>> For modifying row values you need to assign a complete row object.
>>> Something like:
>>>
>>> for i in range(len(table)):
>>>       myrow = table[i]
>>>       myrow['results'][:NUM_SIM, :, :] =
>>> numpy.random.random(size=(NUM_SIM, len(ALL_DATES), 7))
>>>       table[i] = myrow
>>>
>>> You may also use Table.modifyColumn() for better efficiency. Look at
>>> the different modification methods here:
>>>
>>> http://pytables.github.com/usersguide/libref/structured_storage.html#table-methods-writing
>>>  
>>>
>>>
>>> and experiment with them.
>> Thank you, I have tried different approaches and they all seem to run
>> more or less at the same speed (see below). I had to slightly modify
>> your code from:
>>
>> table[i] = myrow
>>
>> to
>>
>> table[i] = [myrow]
>>
>> To avoid exceptions.
>>
>> In the newly attached file, I switched to blosc for compression (but
>> with compression level 1) and run a few sensitivities. By calling the
>> attached script as:
>>
>> python pytables_test.py NUM_SIM
>>
>> where "NUM_SIM" is an integer, I get the following timings and file 
>> sizes:
>>
>> C:\MyProjects\Phaser\tests>python pytables_test.py 10
>> Number of simulations   : 10
>> H5 file creation time   : 0.879s
>> Saving results for table: 6.413s
>> H5 file size (MB)       : 193
>>
>>
>> C:\MyProjects\Phaser\tests>python pytables_test.py 100
>> Number of simulations   : 100
>> H5 file creation time   : 4.155s
>> Saving results for table: 86.326s
>> H5 file size (MB)       : 1935
>>
>>
>> I dont think I will try the 1,000 simulations case :-) . I believe I
>> still don't understand what the best strategy would be for my problem.
>> I basically need to save all the simulation results for all the 1,200
>> "objects", each of which has a timeseries matrix of 600x7 size. In the
>> GUI I have, these 1,200 "objects" are grouped into multiple
>> categories, and multiple categories can reference the same "object",
>> i.e.:
>>
>> Category_1: object_1, object_23, object_543, etc...
>> Category_2: object_23, object_100, object_543, etc...
>>
>> So my idea was to save all the "objects" results to disk and, upon the
>> user's choice, build the categories results "on the fly", i.e. by
>> seeking the H5 file on disk for the "objects" belonging to that
>> specific category and summing up all their results (over time, i.e.
>> the 600 time-steps). Maybe I would be better off with a 4D array
>> (NUM_OBJECTS, NUM_SIM, TSTEPS, 7) as a table, but then I will lose the
>> ability to reference the "objects" by their names...
>
> You should keep trying experimenting with different approaches and 
> discover the one that works for you the best.  Regarding using the 4D 
> array as a table, I might be misunderstanding your problem, but you 
> can still reference objects by name by using:
>
> row = table.where("name == %s" % my_name)
> table[row.nrow] = ...


Uh, I rather meant:

row = table.readWhere("name == %s" % my_name)
table[row.nrow] = ...

but you probably got the idea already.

-- 
Francesc Alted


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Large (to very large) datasets...

Reply via email to