On Mon, Jun 11, 2012 at 2:00 PM, Aquil H. Abdullah <aquil.abdul...@gmail.com
> wrote:

> Hello All,
>
> I've recently started using PyTables and I am very excited about it's
> speed and ease of use for large datasets, however, I have a problem that I
> have not been able to solve with regards to user defined table attributes.
>
> I have a table that contains observations about of entities that can be
> classified as different types.  The timestamp for the last observation of
> these entities may be different. For processing, this table I would like to
> be able to determine the timestamp of the last observation for each of
> these entities. The problem is easy as long as I know the entity types.
>  For example:
>
> import tables
> h5file = tables.openFile('data.h5',mode='r+')
> tbl = h5file.getNode('/series','data1')
> last_obs = max(x['timestamp'] for x in tbl.where("""entity_type=='e1'"""))
>
> However, my problems is that as I read from my source I may not always
> know the entity type before hand. I was going to add a last_observation
> attribute to my table, however, I found the link
> https://github.com/PyTables/PyTables/issues/145, which says that
> attributes aren't persistent.
>

Hello Aquil,

This issue only applies to instance attrs on the in-memory object.


> So I have two questions:
>
> 1. Are there any user-defined attributes that are persistent?
>

Yes, these are the HDF5 attributes of a node.   You have to access them
through the "attrs" namespace. To use your example above:

tbl.attrs.last_obs = 42.0

See
http://pytables.github.com/usersguide/libref.html?highlight=attrs#the-attributeset-class
for
more info.


> 2. Does anyone have any other suggestions? Besides separating the entities
> into separate tables where I could then just do a max on the timestamp
> field/col?
>

You could also use numpy.unique() to figure out the entity values and
then and itertools.groupby() to separate the data out. (groupby might not
be the fastest thing to do here.)  Or just use the where() method from
above for each entity.  The point is that you want the unique of the entity
type column only:

entity_types = np.unique(tbl.cols.entity_type)

Another thing is that if the times are roughly chronological, and entities
are evenly dispersed, you could probably get away with only reading in the
end of the table and make things faster:

entity_types = np.unique(tbl.cols.entity_type[-100:])

I hope this helps!
Be Well
Anthony


> --
> Aquil H. Abdullah
> aquil.abdul...@gmail.com
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to