I recently used ptrepack to compact my hdf5 file and forgot to active the options to propagate indexes. Just out of curiosity I decided to compare performance between the two tables (one with index and one without) for some queries.
The table structure looks like this: "gene_mid_pos": UInt32Col(shape=(), dflt=0, pos=3), "gene_start": UInt32Col(shape=(), dflt=0, pos=4), "mac": Int32Col(shape=(), dflt=0, pos=5), "maf": Float32Col(shape=(), dflt=0.0, pos=6), "perc_var_expl": Float32Col(shape=(), dflt=0.0, pos=7), "score": Float32Col(shape=(), dflt=0.0, pos=8), "snp_chr": UInt8Col(shape=(), dflt=0, pos=9), "snp_pos": UInt32Col(shape=(), dflt=0, pos=10)} byteorder := 'little' chunkshape := (4854,) autoIndex := True colindexes := { "gene": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "gene_start": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "snp_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "gene_mid_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "perc_var_expl": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "mac": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "score": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "snp_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "maf": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "gene_end": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "gene_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False} and contains around 7 million rows (sorted after score in descending order) To verify the results I also created a second hdf5 file with ptrepack where I propagate the indexes just to make sure that compacting isn't the root of the performance issue. Here are my results: orig_table ... original table with indexes com_table_no_index = compacted table without indexes com_table_with_index = compacted table with indexes table.readWhere('score > 5') orig_table: 19.4s com_table_with_index: 18.7s com_table_no_index: 7.9s table.readWhere('gene_chr >2') (really low selectivity because snp_chr is between 1 and 5) orig_table: 46.3s com_table_with_index: 44.8s com_table_no_index: 37.8s table.readWhere('(maf >0.2 and maf < 0.9 )') orig_table: 19.4 s com_table_with_index: 19.2 s com_table_no_index: 9.31 s Both queries have a hit rate of 90% table.readWhere('(maf >0.21 and maf < 0.22 )' orig_table: 35.8 s com_table_with_index: 35.7 s com_table_no_index: 29.3 s I tried also a query with a smaller hit-rate (10%) table.readWhere('(gene_start >2000) & (gene_end<900000)') orig_table: 5.67s com_table_with_index: 5.52s com_table_no_index: 4.22s table.readWhere('(snp_chr ==2 )') (really low selectivity because snp_chr is between 1 and 5) orig_table: 12 s com_table_with_index: 12.3 s com_table_no_index: 9.16 s I checked with willQueryUseIndexing if the indexes are used and in all cases they were. Here are some observations: - Querying for a column with high selectivity (score,maf) runs more than 2x fast on a table without index compared to one with an index. - Querying for a column with low selectivity (chr) or with small hit-rate still shows a difference but isn't as high as in the first case. Anyone knows why querying an indexed table is slower than an un-indexed one? Thanks in advance I am using pytables 2.3 cheers Ümit ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users