Hi Anthony I did some profiling with line_profiler. Here are the results for an table with index:
Line # Hits Time Per Hit % Time Line Contents ============================================================== 1469 def readWhere( self, condition, condvars=None, field=None, 1470 start=None, stop=None, step=None ): 1471 """ 1480 """ 1481 1 8 8.0 0.0 self._checkFieldIfNumeric(field) 1482 1483 1 2 2.0 0.0 coords = [ p.nrow for p in 1484 6951894 25942594 3.7 92.4 self._where(condition, condvars, start, stop, step) ] 1485 1 5 5.0 0.0 self._whereCondition = None # reset the conditions 1486 1 6 6.0 0.0 if len(coords) > 1: 1487 1 13 13.0 0.0 cstart, cstop = coords[0], coords[-1]+1 1488 1 5 5.0 0.0 if cstop - cstart == len(coords): 1489 # Chances for monotonically increasing row values. Refine. 1490 1 4 4.0 0.0 inc_seq = numpy.alltrue( 1491 1 1124713 1124713.0 4.0 numpy.arange(cstart, cstop) == numpy.array(coords)) 1492 1 5 5.0 0.0 if inc_seq: 1493 1 1000172 1000172.0 3.6 return self.read(cstart, cstop, field=field) 1494 return self.readCoordinates(coords, field) File: profile.py Function: function_to_test at line 11 Total time: 39.1439 s the profile for the _where() function looks like this: Line # Hits Time Per Hit % Time Line Contents ============================================================== 1423 def _where( self, condition, condvars, 1424 start=None, stop=None, step=None ): 1425 """Low-level counterpart of `self.where()`.""" 1426 1 2 2.0 0.0 if profile: tref = time() 1427 1 1 1.0 0.0 if profile: show_stats("Entering table._where", tref) 1428 # Adjust the slice to be used. 1429 1 20 20.0 0.0 (start, stop, step) = self._processRangeRead(start, stop, step) 1430 1 2 2.0 0.0 if start >= stop: # empty range, reset conditions 1431 self._useIndex = False 1432 self._whereCondition = None 1433 return iter([]) 1434 1435 # Compile the condition and extract usable index conditions. 1436 1 153 153.0 0.0 condvars = self._requiredExprVars(condition, condvars, depth=3) 1437 1 1299 1299.0 0.3 compiled = self._compileCondition(condition, condvars) 1438 1439 # Can we use indexes? 1440 1 2 2.0 0.0 if compiled.index_expressions: 1441 1 1 1.0 0.0 chunkmap = _table__whereIndexed( 1442 1 439602 439602.0 99.6 self, compiled, condition, condvars, start, stop, step) 1443 1 5 5.0 0.0 if type(chunkmap) != numpy.ndarray: 1444 # If it is not a NumPy array it should be an iterator 1445 # Reset conditions 1446 self._useIndex = False 1447 self._whereCondition = None 1448 # ...and return the iterator 1449 return chunkmap 1450 else: 1451 chunkmap = None # default to an in-kernel query 1452 1453 2 7 3.5 0.0 args = [condvars[param] for param in compiled.parameters] 1454 1 5 5.0 0.0 self._whereCondition = (compiled.function, args) 1455 1 141 141.0 0.0 row = tableExtension.Row(self) 1456 1 2 2.0 0.0 if profile: show_stats("Exiting table._where", tref) 1457 1 27 27.0 0.0 return row._iter(start, stop, step, chunkmap=chunkmap) File: profile.py Function: function_to_test at line 11 Total time: 22.4212 s The profile for a non-indexed table looks like this: Line # Hits Time Per Hit % Time Line Contents ============================================================== 1469 def readWhere( self, condition, condvars=None, field=None, 1470 start=None, stop=None, step=None ): 1480 """ 1481 1 8 8.0 0.0 self._checkFieldIfNumeric(field) 1482 1483 1 1 1.0 0.0 coords = [ p.nrow for p in 1484 6951894 14060474 2.0 87.2 self._where(condition, condvars, start, stop, step) ] 1485 1 4 4.0 0.0 self._whereCondition = None # reset the conditions 1486 1 4 4.0 0.0 if len(coords) > 1: 1487 1 10 10.0 0.0 cstart, cstop = coords[0], coords[-1]+1 1488 1 5 5.0 0.0 if cstop - cstart == len(coords): 1489 # Chances for monotonically increasing row values. Refine. 1490 1 3 3.0 0.0 inc_seq = numpy.alltrue( 1491 1 1104094 1104094.0 6.8 numpy.arange(cstart, cstop) == numpy.array(coords)) 1492 1 4 4.0 0.0 if inc_seq: 1493 1 959634 959634.0 6.0 return self.read(cstart, cstop, field=field) 1494 return self.readCoordinates(coords, field) File: profile.py Function: function_to_test at line 11 Total time: 27.5148 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 1423 def _where( self, condition, condvars, 1424 start=None, stop=None, step=None ): 1425 """Low-level counterpart of `self.where()`.""" 1426 1 2 2.0 0.1 if profile: tref = time() 1427 1 2 2.0 0.1 if profile: show_stats("Entering table._where", tref) 1428 # Adjust the slice to be used. 1429 1 29 29.0 2.1 (start, stop, step) = self._processRangeRead(start, stop, step) 1430 1 2 2.0 0.1 if start >= stop: # empty range, reset conditions 1431 self._useIndex = False 1432 self._whereCondition = None 1433 return iter([]) 1434 1435 # Compile the condition and extract usable index conditions. 1436 1 154 154.0 10.9 condvars = self._requiredExprVars(condition, condvars, depth=3) 1437 1 1057 1057.0 75.0 compiled = self._compileCondition(condition, condvars) 1438 1439 # Can we use indexes? 1440 1 2 2.0 0.1 if compiled.index_expressions: 1441 chunkmap = _table__whereIndexed( 1442 self, compiled, condition, condvars, start, stop, step) 1443 if type(chunkmap) != numpy.ndarray: 1444 # If it is not a NumPy array it should be an iterator 1445 # Reset conditions 1446 self._useIndex = False 1447 self._whereCondition = None 1448 # ...and return the iterator 1449 return chunkmap 1450 else: 1451 1 2 2.0 0.1 chunkmap = None # default to an in-kernel query 1452 1453 2 4 2.0 0.3 args = [condvars[param] for param in compiled.parameters] 1454 1 2 2.0 0.1 self._whereCondition = (compiled.function, args) 1455 1 128 128.0 9.1 row = tableExtension.Row(self) 1456 1 2 2.0 0.1 if profile: show_stats("Exiting table._where", tref) 1457 1 23 23.0 1.6 return row._iter(start, stop, step, chunkmap=chunkmap) File: profile.py Function: function_to_test at line 11 Total time: 10.7759 s Well I have the hdf5 file which reproduces the problem. I will try to write a script which creates the an hdf5 file that shows the same behavior. On Sun, Jan 22, 2012 at 11:20 PM, Anthony Scopatz <scop...@gmail.com> wrote: > Hello Ümit, > > Yes this is some seriously messed up behavior. I would suggest profiling > (kern_prof, line_profiler) to try to figure out what is going on here. The > is counter productive at the very least.... > > My suspicion is that when indexes are used that BOTH the index and the table > values themselves are being compared rather than ONLY the index being > compared. This is just my suspicion, but it would seem to give this > behavior. Profiling would at least tell us which function or method is the > trouble maker. > > Do you have a script that reproduces this as a whole? > > Be Well > Anthony > > On Sat, Jan 21, 2012 at 7:23 AM, Ümit Seren <uemit.se...@gmail.com> wrote: >> >> I recently used ptrepack to compact my hdf5 file and forgot to active >> the options to propagate indexes. >> Just out of curiosity I decided to compare performance between the two >> tables (one with index and one without) for some queries. >> >> The table structure looks like this: >> >> "gene_mid_pos": UInt32Col(shape=(), dflt=0, pos=3), >> "gene_start": UInt32Col(shape=(), dflt=0, pos=4), >> "mac": Int32Col(shape=(), dflt=0, pos=5), >> "maf": Float32Col(shape=(), dflt=0.0, pos=6), >> "perc_var_expl": Float32Col(shape=(), dflt=0.0, pos=7), >> "score": Float32Col(shape=(), dflt=0.0, pos=8), >> "snp_chr": UInt8Col(shape=(), dflt=0, pos=9), >> "snp_pos": UInt32Col(shape=(), dflt=0, pos=10)} >> byteorder := 'little' >> chunkshape := (4854,) >> autoIndex := True >> colindexes := { >> "gene": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "gene_start": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "snp_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "gene_mid_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "perc_var_expl": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "mac": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "score": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "snp_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "maf": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "gene_end": Index(6, medium, shuffle, zlib(1)).is_CSI=False, >> "gene_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False} >> >> >> and contains around 7 million rows (sorted after score in descending >> order) >> >> To verify the results I also created a second hdf5 file with ptrepack >> where I propagate the indexes just to make sure that compacting isn't >> the root of the performance issue. >> Here are my results: >> >> orig_table ... original table with indexes >> com_table_no_index = compacted table without indexes >> com_table_with_index = compacted table with indexes >> >> table.readWhere('score > 5') >> >> orig_table: 19.4s >> com_table_with_index: 18.7s >> com_table_no_index: 7.9s >> >> >> table.readWhere('gene_chr >2') (really low selectivity because snp_chr >> is between 1 and 5) >> >> orig_table: 46.3s >> com_table_with_index: 44.8s >> com_table_no_index: 37.8s >> >> table.readWhere('(maf >0.2 and maf < 0.9 )') >> >> orig_table: 19.4 s >> com_table_with_index: 19.2 s >> com_table_no_index: 9.31 s >> >> Both queries have a hit rate of 90% >> >> table.readWhere('(maf >0.21 and maf < 0.22 )' >> orig_table: 35.8 s >> com_table_with_index: 35.7 s >> com_table_no_index: 29.3 s >> >> I tried also a query with a smaller hit-rate (10%) >> >> >> table.readWhere('(gene_start >2000) & (gene_end<900000)') >> >> orig_table: 5.67s >> com_table_with_index: 5.52s >> com_table_no_index: 4.22s >> >> >> table.readWhere('(snp_chr ==2 )') (really low selectivity because >> snp_chr is between 1 and 5) >> >> orig_table: 12 s >> com_table_with_index: 12.3 s >> com_table_no_index: 9.16 s >> >> >> >> I checked with willQueryUseIndexing if the indexes are used and in all >> cases they were. >> >> Here are some observations: >> >> - Querying for a column with high selectivity (score,maf) runs more >> than 2x fast on a table without index compared to one with an index. >> - Querying for a column with low selectivity (chr) or with small >> hit-rate still shows a difference but isn't as high as in the first >> case. >> >> >> Anyone knows why querying an indexed table is slower than an un-indexed >> one? >> >> Thanks in advance >> >> I am using pytables 2.3 >> >> cheers >> Ümit >> >> >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Pytables-users mailing list >> Pytables-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users