Re: [Pytables-users] Performance problems in indexed tables.

Anthony Scopatz Tue, 24 Jan 2012 11:12:55 -0800

Hi Ümit,

On inspecting the issue a bit more, it seems like the culprit is
Index.get_chuckmap() in index.py.  I can't tell what exactly in this
function is causing the slowdown, but I do see a couple of possible
problem spots.  More profiling in is probably needed.  Running
kernprof on the index.py file would be helpful.  (Or a demonstration
script too.)


Also it turns out that we have some profiling & reporting built in to
pytables. If you include the following in your script, you should see
the results (not as detailed as line_profiler):

import tables.index

tables.index.profile = True


Unfortunately, the non_indexed.lprof file that you sent earlier doesn't
seem to work...

I look forward to hearing from you soon...

Be Well
Anthony

On Mon, Jan 23, 2012 at 6:46 AM, Ümit Seren <uemit.se...@gmail.com> wrote:

> Because the profile output is probably not formatted properly in the
> mail I attached the two line_profiler profile output files. In
> addition to this I also added the profile for _table__whereIndexed()
> function.
>
>
> On Mon, Jan 23, 2012 at 12:13 PM, Ümit Seren <uemit.se...@gmail.com>
> wrote:
> > Hi Anthony
> > I did some profiling with line_profiler.
> >
> > Here are the results for an table with index:
> >
> > Line #      Hits         Time  Per Hit   % Time  Line Contents
> > ==============================================================
> >  1469                                               def readWhere(
> > self, condition, condvars=None, field=None,
> >  1470
> > start=None, stop=None, step=None ):
> >  1471                                                   """
> >  1480                                                   """
> >  1481         1            8      8.0      0.0
> > self._checkFieldIfNumeric(field)
> >  1482
> >  1483         1            2      2.0      0.0          coords = [
> > p.nrow for p in
> >  1484 6951894     25942594      3.7     92.4
> > self._where(condition, condvars, start, stop, step) ]
> >  1485         1            5      5.0      0.0
> > self._whereCondition = None  # reset the conditions
> >  1486         1            6      6.0      0.0          if len(coords) >
> 1:
> >  1487         1           13     13.0      0.0              cstart,
> > cstop = coords[0], coords[-1]+1
> >  1488         1            5      5.0      0.0              if cstop
> > - cstart == len(coords):
> >  1489                                                           #
> > Chances for monotonically increasing row values. Refine.
> >  1490         1            4      4.0      0.0
> > inc_seq = numpy.alltrue(
> >  1491         1      1124713 1124713.0      4.0
> > numpy.arange(cstart, cstop) == numpy.array(coords))
> >  1492         1            5      5.0      0.0                  if
> inc_seq:
> >  1493         1      1000172 1000172.0      3.6
> > return self.read(cstart, cstop, field=field)
> >  1494                                                   return
> > self.readCoordinates(coords, field)
> >
> > File: profile.py
> > Function: function_to_test at line 11
> > Total time: 39.1439 s
> >
> > the profile for the _where() function looks like this:
> >
> > Line #      Hits         Time  Per Hit   % Time  Line Contents
> > ==============================================================
> >  1423                                               def _where( self,
> > condition, condvars,
> >  1424
> > start=None, stop=None, step=None ):
> >  1425                                                   """Low-level
> > counterpart of `self.where()`."""
> >  1426         1            2      2.0      0.0          if profile:
> > tref = time()
> >  1427         1            1      1.0      0.0          if profile:
> > show_stats("Entering table._where", tref)
> >  1428                                                   # Adjust the
> > slice to be used.
> >  1429         1           20     20.0      0.0          (start, stop,
> > step) = self._processRangeRead(start, stop, step)
> >  1430         1            2      2.0      0.0          if start >=
> > stop:  # empty range, reset conditions
> >  1431
> > self._useIndex = False
> >  1432
> > self._whereCondition = None
> >  1433                                                       return
> iter([])
> >  1434
> >  1435                                                   # Compile the
> > condition and extract usable index conditions.
> >  1436         1          153    153.0      0.0          condvars =
> > self._requiredExprVars(condition, condvars, depth=3)
> >  1437         1         1299   1299.0      0.3          compiled =
> > self._compileCondition(condition, condvars)
> >  1438
> >  1439                                                   # Can we use
> indexes?
> >  1440         1            2      2.0      0.0          if
> > compiled.index_expressions:
> >  1441         1            1      1.0      0.0              chunkmap
> > = _table__whereIndexed(
> >  1442         1       439602 439602.0     99.6                  self,
> > compiled, condition, condvars, start, stop, step)
> >  1443         1            5      5.0      0.0              if
> > type(chunkmap) != numpy.ndarray:
> >  1444                                                           # If
> > it is not a NumPy array it should be an iterator
> >  1445                                                           #
> > Reset conditions
> >  1446
> > self._useIndex = False
> >  1447
> > self._whereCondition = None
> >  1448                                                           #
> > ...and return the iterator
> >  1449                                                           return
> chunkmap
> >  1450                                                   else:
> >  1451                                                       chunkmap
> > = None  # default to an in-kernel query
> >  1452
> >  1453         2            7      3.5      0.0          args =
> > [condvars[param] for param in compiled.parameters]
> >  1454         1            5      5.0      0.0
> > self._whereCondition = (compiled.function, args)
> >  1455         1          141    141.0      0.0          row =
> > tableExtension.Row(self)
> >  1456         1            2      2.0      0.0          if profile:
> > show_stats("Exiting table._where", tref)
> >  1457         1           27     27.0      0.0          return
> > row._iter(start, stop, step, chunkmap=chunkmap)
> >
> > File: profile.py
> > Function: function_to_test at line 11
> > Total time: 22.4212 s
> >
> >
> > The profile for a non-indexed table looks like this:
> >
> > Line #      Hits         Time  Per Hit   % Time  Line Contents
> > ==============================================================
> >  1469                                               def readWhere(
> > self, condition, condvars=None, field=None,
> >  1470
> > start=None, stop=None, step=None ):
> >  1480                                                   """
> >  1481         1            8      8.0      0.0
> > self._checkFieldIfNumeric(field)
> >  1482
> >  1483         1            1      1.0      0.0          coords = [
> > p.nrow for p in
> >  1484   6951894     14060474      2.0     87.2
> > self._where(condition, condvars, start, stop, step) ]
> >  1485         1            4      4.0      0.0
> > self._whereCondition = None  # reset the conditions
> >  1486         1            4      4.0      0.0          if len(coords) >
> 1:
> >  1487         1           10     10.0      0.0              cstart,
> > cstop = coords[0], coords[-1]+1
> >  1488         1            5      5.0      0.0              if cstop
> > - cstart == len(coords):
> >  1489                                                           #
> > Chances for monotonically increasing row values. Refine.
> >  1490         1            3      3.0      0.0
> > inc_seq = numpy.alltrue(
> >  1491         1      1104094 1104094.0      6.8
> > numpy.arange(cstart, cstop) == numpy.array(coords))
> >  1492         1            4      4.0      0.0                  if
> inc_seq:
> >  1493         1       959634 959634.0      6.0
> > return self.read(cstart, cstop, field=field)
> >  1494                                                   return
> > self.readCoordinates(coords, field)
> >
> > File: profile.py
> > Function: function_to_test at line 11
> > Total time: 27.5148 s
> >
> > Line #      Hits         Time  Per Hit   % Time  Line Contents
> > ==============================================================
> >  1423                                               def _where( self,
> > condition, condvars,
> >  1424
> > start=None, stop=None, step=None ):
> >  1425                                                   """Low-level
> > counterpart of `self.where()`."""
> >  1426         1            2      2.0      0.1          if profile:
> > tref = time()
> >  1427         1            2      2.0      0.1          if profile:
> > show_stats("Entering table._where", tref)
> >  1428                                                   # Adjust the
> > slice to be used.
> >  1429         1           29     29.0      2.1          (start, stop,
> > step) = self._processRangeRead(start, stop, step)
> >  1430         1            2      2.0      0.1          if start >=
> > stop:  # empty range, reset conditions
> >  1431
> > self._useIndex = False
> >  1432
> > self._whereCondition = None
> >  1433                                                       return
> iter([])
> >  1434
> >  1435                                                   # Compile the
> > condition and extract usable index conditions.
> >  1436         1          154    154.0     10.9          condvars =
> > self._requiredExprVars(condition, condvars, depth=3)
> >  1437         1         1057   1057.0     75.0          compiled =
> > self._compileCondition(condition, condvars)
> >  1438
> >  1439                                                   # Can we use
> indexes?
> >  1440         1            2      2.0      0.1          if
> > compiled.index_expressions:
> >  1441                                                       chunkmap
> > = _table__whereIndexed(
> >  1442                                                           self,
> > compiled, condition, condvars, start, stop, step)
> >  1443                                                       if
> > type(chunkmap) != numpy.ndarray:
> >  1444                                                           # If
> > it is not a NumPy array it should be an iterator
> >  1445                                                           #
> > Reset conditions
> >  1446
> > self._useIndex = False
> >  1447
> > self._whereCondition = None
> >  1448                                                           #
> > ...and return the iterator
> >  1449                                                           return
> chunkmap
> >  1450                                                   else:
> >  1451         1            2      2.0      0.1              chunkmap
> > = None  # default to an in-kernel query
> >  1452
> >  1453         2            4      2.0      0.3          args =
> > [condvars[param] for param in compiled.parameters]
> >  1454         1            2      2.0      0.1
> > self._whereCondition = (compiled.function, args)
> >  1455         1          128    128.0      9.1          row =
> > tableExtension.Row(self)
> >  1456         1            2      2.0      0.1          if profile:
> > show_stats("Exiting table._where", tref)
> >  1457         1           23     23.0      1.6          return
> > row._iter(start, stop, step, chunkmap=chunkmap)
> >
> > File: profile.py
> > Function: function_to_test at line 11
> > Total time: 10.7759 s
> >
> >
> > Well I have the hdf5 file which reproduces the problem. I will try to
> > write a script which creates the an hdf5 file that shows the same
> > behavior.
> >
> > On Sun, Jan 22, 2012 at 11:20 PM, Anthony Scopatz <scop...@gmail.com>
> wrote:
> >> Hello Ümit,
> >>
> >> Yes this is some seriously messed up behavior.  I would suggest
> profiling
> >> (kern_prof, line_profiler) to try to figure out what is going on here.
>  The
> >> is counter productive at the very least....
> >>
> >> My suspicion is that when indexes are used that BOTH the index and the
> table
> >> values themselves are being compared rather than ONLY the index being
> >> compared.  This is just my suspicion, but it would seem to give this
> >> behavior.  Profiling would at least tell us which function or method is
> the
> >> trouble maker.
> >>
> >> Do you have a script that reproduces this as a whole?
> >>
> >> Be Well
> >> Anthony
> >>
> >> On Sat, Jan 21, 2012 at 7:23 AM, Ümit Seren <uemit.se...@gmail.com>
> wrote:
> >>>
> >>> I recently used ptrepack to compact my hdf5 file and forgot to active
> >>> the options to propagate indexes.
> >>> Just out of curiosity I decided to compare performance between the two
> >>> tables (one with index and one without) for some queries.
> >>>
> >>> The table structure looks like this:
> >>>
> >>>  "gene_mid_pos": UInt32Col(shape=(), dflt=0, pos=3),
> >>>  "gene_start": UInt32Col(shape=(), dflt=0, pos=4),
> >>>  "mac": Int32Col(shape=(), dflt=0, pos=5),
> >>>  "maf": Float32Col(shape=(), dflt=0.0, pos=6),
> >>>  "perc_var_expl": Float32Col(shape=(), dflt=0.0, pos=7),
> >>>  "score": Float32Col(shape=(), dflt=0.0, pos=8),
> >>>  "snp_chr": UInt8Col(shape=(), dflt=0, pos=9),
> >>>  "snp_pos": UInt32Col(shape=(), dflt=0, pos=10)}
> >>>  byteorder := 'little'
> >>>  chunkshape := (4854,)
> >>>  autoIndex := True
> >>>  colindexes := {
> >>>    "gene": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "gene_start": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "snp_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "gene_mid_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "perc_var_expl": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "mac": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "score": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "snp_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "maf": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "gene_end": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
> >>>    "gene_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
> >>>
> >>>
> >>> and contains around 7 million rows (sorted after score in descending
> >>> order)
> >>>
> >>> To verify the results I also created a second hdf5 file with ptrepack
> >>> where I propagate the indexes just to make sure that compacting isn't
> >>> the root of the performance issue.
> >>> Here are my results:
> >>>
> >>> orig_table  ... original table with indexes
> >>> com_table_no_index = compacted table without indexes
> >>> com_table_with_index = compacted table with indexes
> >>>
> >>> table.readWhere('score > 5')
> >>>
> >>> orig_table:                  19.4s
> >>> com_table_with_index: 18.7s
> >>> com_table_no_index:     7.9s
> >>>
> >>>
> >>> table.readWhere('gene_chr >2') (really low selectivity because snp_chr
> >>> is between 1 and 5)
> >>>
> >>> orig_table:                   46.3s
> >>> com_table_with_index: 44.8s
> >>> com_table_no_index:    37.8s
> >>>
> >>> table.readWhere('(maf >0.2 and maf < 0.9 )')
> >>>
> >>> orig_table:                  19.4 s
> >>> com_table_with_index: 19.2 s
> >>> com_table_no_index:    9.31 s
> >>>
> >>> Both queries have a hit rate of 90%
> >>>
> >>> table.readWhere('(maf >0.21 and maf < 0.22 )'
> >>> orig_table:                  35.8 s
> >>> com_table_with_index: 35.7  s
> >>> com_table_no_index:    29.3 s
> >>>
> >>> I tried also a query with a smaller hit-rate (10%)
> >>>
> >>>
> >>> table.readWhere('(gene_start >2000) & (gene_end<900000)')
> >>>
> >>> orig_table:                   5.67s
> >>> com_table_with_index: 5.52s
> >>> com_table_no_index:   4.22s
> >>>
> >>>
> >>> table.readWhere('(snp_chr ==2 )') (really low selectivity because
> >>> snp_chr is between 1 and 5)
> >>>
> >>> orig_table:                  12 s
> >>> com_table_with_index: 12.3 s
> >>> com_table_no_index:    9.16 s
> >>>
> >>>
> >>>
> >>> I checked with willQueryUseIndexing if the indexes are used and in all
> >>> cases they were.
> >>>
> >>> Here are some observations:
> >>>
> >>> - Querying for a column with high selectivity (score,maf) runs more
> >>> than 2x fast on a table without index compared to one with an index.
> >>> - Querying for a column with low selectivity (chr) or with small
> >>> hit-rate still shows a difference but isn't as high as in the first
> >>> case.
> >>>
> >>>
> >>> Anyone knows why querying an indexed table is slower than an un-indexed
> >>> one?
> >>>
> >>> Thanks in advance
> >>>
> >>> I am using pytables 2.3
> >>>
> >>> cheers
> >>> Ümit
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Try before you buy = See our experts in action!
> >>> The most comprehensive online learning library for Microsoft developers
> >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3,
> MVC3,
> >>> Metro Style Apps, more. Free future releases when you subscribe now!
> >>> http://p.sf.net/sfu/learndevnow-dev2
> >>> _______________________________________________
> >>> Pytables-users mailing list
> >>> Pytables-users@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Try before you buy = See our experts in action!
> >> The most comprehensive online learning library for Microsoft developers
> >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> >> Metro Style Apps, more. Free future releases when you subscribe now!
> >> http://p.sf.net/sfu/learndevnow-dev2
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pytables-users@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >>
>
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Performance problems in indexed tables.

Reply via email to