Hi, this has been sent to the pytables list some days ago. Please, note that you sohuld subscribe to the list to avoid your messages being rejected.
Cheers, ------------------------ Original message ----------------------------------- From: Jun Li <[EMAIL PROTECTED]> To: pytables-users@lists.sourceforge.net Date: Friday 21:27:03 Hello, All: I am using Python 2.4, pytables 1.3, numarray-1.5.1, hdf5-1.6.5 on Linux 2.4 running on a pretty powerful Dell server. I have a pytable which has 7 columns and holds roughly 2.6 million rows of data. Here is a my table structure: class Ttable (IsDescription): n_id = StringCol(length=16,pos=1) date = IntCol(pos=2) tmax = Float32Col(pos=3) tmax_flag = IntCol(pos=4) tmin = Float32Col(pos=5) tmin_flag = IntCol(pos=6) mc = IntCol(pos=7) I have a little program retrieving data according to some conditions and do some calculations or processing with the retrieved data: code sample: tbl_T = h5file.root.T_table num_of_days = int(integertoDate(tbl_T.attrs.endDate).absdays - integertoDate(tbl_T.attrs.startDate).absdays) i = tbl_T.nrows for x in tbl_T : if (i%num_of_days) == 0 : n_id = x['n_id'] numofrows = 0 ct,mc = 0,0 t,tx,tn = 0.0,0.0,0.0 tnct,txct = 0,0 hdd,cdd = 0.0,0.0 gd4,gd5 = 0.0,0.0 if x['date'] >= startDate : if n_id == x['n_id'] and x['date'] < endDate : if (x['tmax_flag'] and (x['tmax'] < maxVal) and (x['tmax'] >= minVal) and x['tmin_flag'] and (x['tmin'] < maxVal) and (x['tmin'] >= minVal)) : #do something else: mc = mc + 1 numofrows = numofrows + 1 if numofrows == nDays : #do other thing the performance is not very good, far worse than I expected (it roughly 140 seconds for a run). I found the performance tips with regard to indexed searches in the "pyTable's user Guide" manual. So I indexed all columns which appears in the selection conditions. class Ttable (IsDescription): n_id = StringCol(length=16,pos=1, indexed=1) date = IntCol(pos=2,indexed=1) tmax = Float32Col(pos=3,indexed=1) tmax_flag = IntCol(pos=4,indexed=1) tmin = Float32Col(pos=5,indexed=1) tmin_flag = IntCol(pos=6,indexed=1) mc = IntCol(pos=7) I rebuilt the table and rerun the retrieval program, run-time was almost the same, no improve whatsoever. I even tried only index column 'n_id' and or 'date' or other combinations of columns but not all columns and re-run the program,the same thing happened. Why indexed search has no effect in my case? I read some postings on mail-lists archive. It is said that string index search is slower than integer. My 'n_id' column has to be string type. If I instead generate Integer ids and feed them to the column(e.g. using hash() function) and then index this integer column, does this help improve performance? In my case (as the above code samples shows), are there any other ways to improve performance? Any helps, suggestions and comments are appreciated. Thanks. Dave -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users