[Pytables-users] advice on data representation

benjamin.bertrand Tue, 31 Jul 2012 02:15:22 -0700

Hello,

I 've started to write a parser to convert ASTERIX data to HDF5, but I have 
some problem to represent all the data.

I use table objects. I've defined a class for each category record (a record is 
made of different data items).
See below as an example for category 30.

1. Some data items are optional. Is there a good way to mark a column as valid?
For enum, I can easily add an "uninitialized" default value.
For (U)Int, most of the time there is no good default value that could tell me 
if the data is valid or not.
I thought about using np.nan, but that's only for float.
Or I could add a bool variable (valid) to each column.
Is there another way?

2. Some data items have a variable length (in fact some fields can be repeated).
In the class I030_050_DESC for example, if FX is set to 1, then the fields are 
repeated.
I know that fields cannot be of a variable length in a table object.
I could try to use a shape (max_length,) for those columns but the max length 
would be a bit arbitrary as there is no theoretical limit (even if in practice, 
it is often quite low).
Or should I try to represent the data using a VLArray?
I found it quite natural to represent my data as a table and I don't really see 
how I could do the same with an array.

Cheers,

Benjamin

class I030_180_DESC(tables.IsDescription):
    """Calculated Track Velocity (Polar)"""
    SPEED = tables.UInt16Col(pos=0)
    HEADING = tables.UInt16Col(pos=1)

class I030_181_DESC(tables.IsDescription):
    """Calculated Track Velocity (Cartesian)"""
    X = tables.Int16Col(pos=0)
    Y = tables.Int16Col(pos=1)

class I030_340_DESC(tables.IsDescription):
    """Last Measured Mode 3/A"""
    V = tables.EnumCol(tables.Enum({
        "Code validated": 0,
        "Code not validated": 1,
        "uninitialized": 255
        }), "uninitialized",
        base="uint8",
        pos=0)
    G = tables.EnumCol(tables.Enum({
        "Default": 0,
        "Garbled code": 1,
        "uninitialized": 255
        }), "uninitialized",
        base="uint8",
        pos=1)
    L = tables.EnumCol(tables.Enum({
        "MODE 3/A code as derived from the reply of the transponder,": 0,
        "Smoothed MODE 3/A code as provided by a local tracker": 1
        "uninitialized": 255
        }), "uninitialized",
        base="uint8",
        pos=2)
    sb = tables.UInt8Col(pos=3)
    mode_3_a = tables.UInt16Col(pos=4)

class I030_400_DESC(tables.IsDescription):
    """Callsign"""
    callsign = tables.StringCol(7, pos=0)

class I030_050_DESC(tables.IsDescription):
    """Artas Track Number"""
    AUI = tables.UInt8Col(pos=0)
    unused = tables.UInt8Col(pos=1)
    STN = tables.UInt16Col(pos=2)
    FX = tables.EnumCol(tables.Enum({
        "end of data item": 0,
        "extension into next extent": 1,
        "uninitialized": 255
        }), "uninitialized",
        base="uint8",
        pos=3)

class I030Record(tables.IsDescription):
    """Cat 030 record"""
    ff_timestamp = tables.Time32Col()
    I030_010 = I030_010_DESC()
    I030_015 = I030_015_DESC()
    I030_030 = I030_030_DESC()
    I030_035 = I030_035_DESC()
    I030_040 = I030_040_DESC()
    I030_070 = I030_070_DESC()
    I030_170 = I030_170_DESC()
    I030_100 = I030_100_DESC()
    I030_180 = I030_180_DESC()
    I030_181 = I030_181_DESC()
    I030_060 = I030_060_DESC()
    I030_150 = I030_150_DESC()
    I030_140 = I030_140_DESC()
    I030_340 = I030_340_DESC()
    I030_400 = I030_400_DESC()
...
    I030_210 = I030_210_DESC()
    I030_120 = I030_120_DESC()
    I030_050 = I030_050_DESC()
    I030_270 = I030_270_DESC()
    I030_370 = I030_370_DESC()

Från: Anthony Scopatz [mailto:[email protected]]
Skickat: den 12 juli 2012 00:02
Till: Discussion list for PyTables
Ämne: Re: [Pytables-users] advice on using PyTables

Hello Benjamin,

Not knowing to much about the ASTERIX format, other than what you said and what 
is in the links, I would say that this is a good fit for HDF5 and PyTables.  
PyTables will certainly help you read in the data and manipulate it.

However, before you abandon hachoir completely, I will say it is a lot easier 
to write hdf5 files in PyTables than to use the HDF5 C API.   If hachoir is too 
slow, have you tried profiling the code to see what is taking up the most time? 
 Maybe you could just rewrite these parts in C?  Have you looked into 
Cythonizing it?  Also, you don't seem to be using numpy to read in the data... 
(there are some tricks given ASTERIX here, but not insurmountable).

I ask the above, just so you don't have to completely rewrite everything.  You 
are correct though that pure python is probably not sufficient.  Feel free to 
ask more questions here.

Be Well
Anthony

On Wed, Jul 11, 2012 at 6:52 AM, 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I'm working with Air Traffic Management and would like to perform checks / 
compute statistics on ASTERIX data.
ASTERIX is an ATM Surveillance Data Binary Messaging Format 
(http://www.eurocontrol.int/asterix/public/standard_page/overview.html)

The data consist of a concatenation of consecutive data blocks.
Each data block consists of data category + length + records.
Each record is of variable length and consists of several data items (that are 
well defined for each category).
Some data items might be present or not depending on a field specification 
(bitfield).

I started to write a parser using hachoir 
(https://bitbucket.org/haypo/hachoir/overview) a pure python library.
But the parsing was really too slow and taking a lot of memory.
That's not really useable.

>From what I read, PyTables could really help to manipulate and analyze the 
>data.
So I've been thinking about writing a tool (probably in C) to convert my 
ASTERIX format to HDF5.

Before I start, I'd like confirmation that this seems like a suitable 
application for PyTables.
Is there another approach than writing a conversion tool to HDF5?

Thanks in advance

Benjamin

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] advice on data representation

Reply via email to