Hello Ümit,

Yes this is some seriously messed up behavior.  I would suggest profiling
(kern_prof, line_profiler) to try to figure out what is going on here.  The
is counter productive at the very least....

My suspicion is that when indexes are used that BOTH the index and the
table values themselves are being compared rather than ONLY the index being
compared.  This is just my suspicion, but it would seem to give this
behavior.  Profiling would at least tell us which function or method is the
trouble maker.

Do you have a script that reproduces this as a whole?

Be Well
Anthony

On Sat, Jan 21, 2012 at 7:23 AM, Ümit Seren <uemit.se...@gmail.com> wrote:

> I recently used ptrepack to compact my hdf5 file and forgot to active
> the options to propagate indexes.
> Just out of curiosity I decided to compare performance between the two
> tables (one with index and one without) for some queries.
>
> The table structure looks like this:
>
>  "gene_mid_pos": UInt32Col(shape=(), dflt=0, pos=3),
>  "gene_start": UInt32Col(shape=(), dflt=0, pos=4),
>  "mac": Int32Col(shape=(), dflt=0, pos=5),
>  "maf": Float32Col(shape=(), dflt=0.0, pos=6),
>  "perc_var_expl": Float32Col(shape=(), dflt=0.0, pos=7),
>  "score": Float32Col(shape=(), dflt=0.0, pos=8),
>  "snp_chr": UInt8Col(shape=(), dflt=0, pos=9),
>  "snp_pos": UInt32Col(shape=(), dflt=0, pos=10)}
>  byteorder := 'little'
>  chunkshape := (4854,)
>  autoIndex := True
>  colindexes := {
>    "gene": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "gene_start": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "snp_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "gene_mid_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "perc_var_expl": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "mac": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "score": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "snp_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "maf": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "gene_end": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
>    "gene_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
>
>
> and contains around 7 million rows (sorted after score in descending order)
>
> To verify the results I also created a second hdf5 file with ptrepack
> where I propagate the indexes just to make sure that compacting isn't
> the root of the performance issue.
> Here are my results:
>
> orig_table  ... original table with indexes
> com_table_no_index = compacted table without indexes
> com_table_with_index = compacted table with indexes
>
> table.readWhere('score > 5')
>
> orig_table:                  19.4s
> com_table_with_index: 18.7s
> com_table_no_index:     7.9s
>
>
> table.readWhere('gene_chr >2') (really low selectivity because snp_chr
> is between 1 and 5)
>
> orig_table:                   46.3s
> com_table_with_index: 44.8s
> com_table_no_index:    37.8s
>
> table.readWhere('(maf >0.2 and maf < 0.9 )')
>
> orig_table:                  19.4 s
> com_table_with_index: 19.2 s
> com_table_no_index:    9.31 s
>
> Both queries have a hit rate of 90%
>
> table.readWhere('(maf >0.21 and maf < 0.22 )'
> orig_table:                  35.8 s
> com_table_with_index: 35.7  s
> com_table_no_index:    29.3 s
>
> I tried also a query with a smaller hit-rate (10%)
>
>
> table.readWhere('(gene_start >2000) & (gene_end<900000)')
>
> orig_table:                   5.67s
> com_table_with_index: 5.52s
> com_table_no_index:   4.22s
>
>
> table.readWhere('(snp_chr ==2 )') (really low selectivity because
> snp_chr is between 1 and 5)
>
> orig_table:                  12 s
> com_table_with_index: 12.3 s
> com_table_no_index:    9.16 s
>
>
>
> I checked with willQueryUseIndexing if the indexes are used and in all
> cases they were.
>
> Here are some observations:
>
> - Querying for a column with high selectivity (score,maf) runs more
> than 2x fast on a table without index compared to one with an index.
> - Querying for a column with low selectivity (chr) or with small
> hit-rate still shows a difference but isn't as high as in the first
> case.
>
>
> Anyone knows why querying an indexed table is slower than an un-indexed
> one?
>
> Thanks in advance
>
> I am using pytables 2.3
>
> cheers
> Ümit
>
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to