Re: [Pytables-users] Pytable data retrieval performance

Francesc Altet Tue, 03 Oct 2006 08:01:03 -0700

> From: Jun Li <[EMAIL PROTECTED]>
> To: pytables-users@lists.sourceforge.net
> Date: Friday 21:27:03
>
> Hello, All:
>
> I am using Python 2.4, pytables 1.3, numarray-1.5.1, hdf5-1.6.5 on Linux
> 2.4 running on a pretty powerful Dell server.
>
> I have a pytable which has 7 columns and holds roughly 2.6 million rows of
> data.
> Here is a my table structure:
>
> class Ttable (IsDescription):
>         n_id =  StringCol(length=16,pos=1)
>         date = IntCol(pos=2)
>         tmax = Float32Col(pos=3)
>         tmax_flag = IntCol(pos=4)
>         tmin = Float32Col(pos=5)
>         tmin_flag = IntCol(pos=6)
>         mc = IntCol(pos=7)
>
> I have a little program retrieving data according to some conditions and
> do some calculations or processing with the retrieved data:
>
> code sample:
>
> tbl_T = h5file.root.T_table
> num_of_days = int(integertoDate(tbl_T.attrs.endDate).absdays -
> integertoDate(tbl_T.attrs.startDate).absdays)
>
>         i = tbl_T.nrows
>         for x in tbl_T :
>                 if (i%num_of_days) == 0 :
>                         n_id = x['n_id']
>
>
>                         numofrows = 0
>                         ct,mc = 0,0
>                         t,tx,tn = 0.0,0.0,0.0
>                         tnct,txct = 0,0
>                         hdd,cdd = 0.0,0.0
>                         gd4,gd5 = 0.0,0.0
>                 if x['date'] >= startDate :
>                         if n_id == x['n_id'] and x['date'] < endDate :
>                                 if (x['tmax_flag'] and (x['tmax'] <
> maxVal) and (x['tmax'] >= minVal) and
>                                         x['tmin_flag'] and (x['tmin'] <
> maxVal) and (x['tmin'] >= minVal)) :
>                                         #do something
>                                 else:
>                                         mc = mc + 1
>
>                                 numofrows = numofrows + 1
>
>                 if numofrows == nDays :
>                         #do other thing
>
>
> the performance is not very good, far worse than I expected (it roughly
> 140 seconds for a run). I found the performance tips with regard to
> indexed searches in the "pyTable's user Guide" manual. So I indexed all
> columns which appears in the selection conditions.
> class Ttable (IsDescription):
>         n_id =  StringCol(length=16,pos=1, indexed=1)
>         date = IntCol(pos=2,indexed=1)
>         tmax = Float32Col(pos=3,indexed=1)
>         tmax_flag = IntCol(pos=4,indexed=1)
>         tmin = Float32Col(pos=5,indexed=1)
>         tmin_flag = IntCol(pos=6,indexed=1)
>         mc = IntCol(pos=7)
>
> I rebuilt the table and rerun the retrieval program, run-time was almost
> the same, no improve whatsoever. I even tried only index column 'n_id' and
> or 'date' or other combinations of columns but not all columns and re-run
> the program,the same thing happened. Why indexed search has no effect in
> my case?


Because you are making the selections outside the Table.where() iterator. You 
absolutely need to use this 'where' selector so as to take advantage of the 
indexation capabilities. Please, read carefully the docs & examples in [1] so 
as to get an idea of how to use it.

Incidentally, you must be aware that you can only pass a single condition to 
where in order to be able to use the index, but still, you can mix this 
single condition with others in the same iterator loop in order to achieve 
better speed for your lookups. Look carefully at [2] for more info on this.

In PyTables Pro there will be no limitation on the number of conditions that 
you can put in the 'where' iterator (so, you can take bigger advantage of 
your indexes). See slides 53, 54 and 55 of my presentation at past EuroPython 
[3] for a glimpse of what you can expect of complex queries with Pro version.

>
> I read some postings on mail-lists archive. It is said that string index
> search is slower than integer. My 'n_id' column has to be string type. If
> I instead generate Integer ids and feed them to the column(e.g. using
> hash() function) and then index this integer column, does this help
> improve performance?

Yeah. String lookups are generally quite a bit slower that integer or floating 
point. So, if you can use an index of an integer column, do it for achieving 
maximum performance.

[1] http://www.pytables.org/docs/manual/x2981.html (section 4.6.2.16)
[2] http://www.pytables.org/docs/manual/x5297.html (section 5.2.2) 
[3] http://www.pytables.org/docs/FindingNeedles.pdf

HTH,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Pytable data retrieval performance

Reply via email to