I recently used ptrepack to compact my hdf5 file and forgot to active
the options to propagate indexes.
Just out of curiosity I decided to compare performance between the two
tables (one with index and one without) for some queries.

The table structure looks like this:

  "gene_mid_pos": UInt32Col(shape=(), dflt=0, pos=3),
  "gene_start": UInt32Col(shape=(), dflt=0, pos=4),
  "mac": Int32Col(shape=(), dflt=0, pos=5),
  "maf": Float32Col(shape=(), dflt=0.0, pos=6),
  "perc_var_expl": Float32Col(shape=(), dflt=0.0, pos=7),
  "score": Float32Col(shape=(), dflt=0.0, pos=8),
  "snp_chr": UInt8Col(shape=(), dflt=0, pos=9),
  "snp_pos": UInt32Col(shape=(), dflt=0, pos=10)}
  byteorder := 'little'
  chunkshape := (4854,)
  autoIndex := True
  colindexes := {
    "gene": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "gene_start": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "snp_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "gene_mid_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "perc_var_expl": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "mac": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "score": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "snp_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "maf": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "gene_end": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "gene_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False}


and contains around 7 million rows (sorted after score in descending order)

To verify the results I also created a second hdf5 file with ptrepack
where I propagate the indexes just to make sure that compacting isn't
the root of the performance issue.
Here are my results:

orig_table  ... original table with indexes
com_table_no_index = compacted table without indexes
com_table_with_index = compacted table with indexes

table.readWhere('score > 5')

orig_table:                  19.4s
com_table_with_index: 18.7s
com_table_no_index:     7.9s


table.readWhere('gene_chr >2') (really low selectivity because snp_chr
is between 1 and 5)

orig_table:                   46.3s
com_table_with_index: 44.8s
com_table_no_index:    37.8s

table.readWhere('(maf >0.2 and maf < 0.9 )')

orig_table:                  19.4 s
com_table_with_index: 19.2 s
com_table_no_index:    9.31 s

Both queries have a hit rate of 90%

table.readWhere('(maf >0.21 and maf < 0.22 )'
orig_table:                  35.8 s
com_table_with_index: 35.7  s
com_table_no_index:    29.3 s

I tried also a query with a smaller hit-rate (10%)


table.readWhere('(gene_start >2000) & (gene_end<900000)')

orig_table:                   5.67s
com_table_with_index: 5.52s
com_table_no_index:   4.22s


table.readWhere('(snp_chr ==2 )') (really low selectivity because
snp_chr is between 1 and 5)

orig_table:                  12 s
com_table_with_index: 12.3 s
com_table_no_index:    9.16 s



I checked with willQueryUseIndexing if the indexes are used and in all
cases they were.

Here are some observations:

- Querying for a column with high selectivity (score,maf) runs more
than 2x fast on a table without index compared to one with an index.
- Querying for a column with low selectivity (chr) or with small
hit-rate still shows a difference but isn't as high as in the first
case.


Anyone knows why querying an indexed table is slower than an un-indexed one?

Thanks in advance

I am using pytables 2.3

cheers
Ümit

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to