I recently used ptrepack to compact my hdf5 file and forgot to active
the options to propagate indexes.
Just out of curiosity I decided to compare performance between the two
tables (one with index and one without) for some queries.
The table structure looks like this:
"gene_mid_pos": UInt32Col(shape=(), dflt=0, pos=3),
"gene_start": UInt32Col(shape=(), dflt=0, pos=4),
"mac": Int32Col(shape=(), dflt=0, pos=5),
"maf": Float32Col(shape=(), dflt=0.0, pos=6),
"perc_var_expl": Float32Col(shape=(), dflt=0.0, pos=7),
"score": Float32Col(shape=(), dflt=0.0, pos=8),
"snp_chr": UInt8Col(shape=(), dflt=0, pos=9),
"snp_pos": UInt32Col(shape=(), dflt=0, pos=10)}
byteorder := 'little'
chunkshape := (4854,)
autoIndex := True
colindexes := {
"gene": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"gene_start": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"snp_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"gene_mid_pos": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"perc_var_expl": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"mac": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"score": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"snp_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"maf": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"gene_end": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"gene_chr": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
and contains around 7 million rows (sorted after score in descending order)
To verify the results I also created a second hdf5 file with ptrepack
where I propagate the indexes just to make sure that compacting isn't
the root of the performance issue.
Here are my results:
orig_table ... original table with indexes
com_table_no_index = compacted table without indexes
com_table_with_index = compacted table with indexes
table.readWhere('score > 5')
orig_table: 19.4s
com_table_with_index: 18.7s
com_table_no_index: 7.9s
table.readWhere('gene_chr >2') (really low selectivity because snp_chr
is between 1 and 5)
orig_table: 46.3s
com_table_with_index: 44.8s
com_table_no_index: 37.8s
table.readWhere('(maf >0.2 and maf < 0.9 )')
orig_table: 19.4 s
com_table_with_index: 19.2 s
com_table_no_index: 9.31 s
Both queries have a hit rate of 90%
table.readWhere('(maf >0.21 and maf < 0.22 )'
orig_table: 35.8 s
com_table_with_index: 35.7 s
com_table_no_index: 29.3 s
I tried also a query with a smaller hit-rate (10%)
table.readWhere('(gene_start >2000) & (gene_end<900000)')
orig_table: 5.67s
com_table_with_index: 5.52s
com_table_no_index: 4.22s
table.readWhere('(snp_chr ==2 )') (really low selectivity because
snp_chr is between 1 and 5)
orig_table: 12 s
com_table_with_index: 12.3 s
com_table_no_index: 9.16 s
I checked with willQueryUseIndexing if the indexes are used and in all
cases they were.
Here are some observations:
- Querying for a column with high selectivity (score,maf) runs more
than 2x fast on a table without index compared to one with an index.
- Querying for a column with low selectivity (chr) or with small
hit-rate still shows a difference but isn't as high as in the first
case.
Anyone knows why querying an indexed table is slower than an un-indexed one?
Thanks in advance
I am using pytables 2.3
cheers
Ümit
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users