RE: [mart-dev] query performance on sequences

Junjun Zhang Fri, 10 Jul 2009 14:24:27 -0700

Hi Rochak,

Sorry for not getting you back sooner. Based on the information you provided 
below, all the necessary indexes are there, there seems nothing quick we can do 
to speed up sequence queries significantly. Of course, we are happy to look 
into this issue more closely. BTW, how many rows do you have in these tables: 
phytozome__transcript__main, phytozome_structure__structure__main and 
sequence_phytozome__dna_chunks__main? It should be also helpful if you could 
find out what proportion of the time each of the 3 steps (you described below) 
takes while retrieving 30k peptides.

In the meantime, if your mart users still have concerns about spending 5 
minutes on getting 30k sequences, you can probably advice them to use the 
'Email notification' option on the MartView GUI.

We will keep you updated once we have something concrete.

Please feel free to contact us should you have any further questions.

Kind regards,
Junjun

________________________________
From: [email protected] [mailto:[email protected]] On Behalf Of 
Rochak Neupane
Sent: Wednesday, July 08, 2009 7:04 PM
To: Syed Haider
Cc: David M. Goodstein; BioMart mart-dev
Subject: Re: [mart-dev] query performance on sequences

>From looking at the logs, it looks like when grabbing sequences, BioMart makes 
>3 different types of queries.

1) query main table. transcript_id_key from transcript__main table (exportable) 
- this is indexed.
Explain on query generated by biomart.

explain SELECT main.transcript_id_key, main.organism_name, main.gene_name FROM 
phytozome_mart.phytozome__transcript__main main WHERE (main.transcript_id_key = 
'16429347') AND (main.organism_id = '113') LIMIT 200;
+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+
| id | select_type | table | type  | possible_keys                              
 | key     | key_len | ref   | rows | Extra |
+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+
|  1 | SIMPLE      | main  | const | 
PRIMARY,transcript_id_key,org_id,org_pac_id | PRIMARY | 4       | const |    1 
|       |
+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+

2) query structure table for coordinates, using transcript_id_key from step 1. 
transcript_id_key is indexed in this table as well.
mysql> explain SELECT main.transcript_id_key, main.transcript_id_key, 
main.chr_name, main.exon_cds_start, main.exon_cds_end, main.exon_chrom_strand, 
main.exon_phase, main.exon_rank, main.organism_id FROM 
phytozome_mart.phytozome_structure__structure__main main WHERE 
(main.transcript_id_key IN('16429347')) LIMIT 200;
+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys     | key               | 
key_len | ref   | rows | Extra       |
+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+
|  1 | SIMPLE      | main  | ref  | transcript_id_key | transcript_id_key | 5   
    | const |    6 | Using where |
+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+

3) finally, using the locations from step 2, query sequence mart using 
chr_name, chr_start, and org_id. all three of these are indexed (index below)
BioMart.Dataset.GenomicSequence.DNAAdaptor:176:INFO> QUERY FULL SQL:  select 
sequence from sequence_phytozome__dna_chunks__main where chr_start = ? and 
chr_name = ? and org_id = ?
BioMart.Dataset.GenomicSequence.DNAAdaptor:177:INFO> QUERY SUB SQL:  select 
substring(sequence, ?, ?) from sequence_phytozome__dna_chunks__main where 
chr_start = ? and chr_name = ? and org_id = ?

Trying an explain with values filled in:
explain select substring(sequence, 29665, 47) from 
sequence_mart.sequence_phytozome__dna_chunks__main where chr_start = 200001 and 
chr_name = 'scaffold_406' and org_id = 77;
+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+
| id | select_type | table                                | type | 
possible_keys                 | key  | key_len | ref               | rows | 
Extra       |
+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+
|  1 | SIMPLE      | sequence_phytozome__dna_chunks__main | ref  | 
chr_name,chr_start,org_id,nsi | nsi  | 52      | const,const,const |    1 | 
Using where |
+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+

mysql> show index from sequence_mart.sequence_phytozome__dna_chunks__main where 
key_name = 'nsi';
+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table                                | Non_unique | Key_name | Seq_in_index | 
Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | 
Comment |
+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| sequence_phytozome__dna_chunks__main |          1 | nsi      |            1 | 
chr_name    | A         |       41014 |     NULL | NULL   |      | BTREE      | 
        |
| sequence_phytozome__dna_chunks__main |          1 | nsi      |            2 | 
chr_start   | A         |      123043 |     NULL | NULL   | YES  | BTREE      | 
        |
| sequence_phytozome__dna_chunks__main |          1 | nsi      |            3 | 
org_id      | A         |      123043 |     NULL | NULL   | YES  | BTREE      | 
        |
+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

seems like proper indexes are used when running these queries. Any way to 
improve these indexes? Or any other ways to improve performance?

Thanks,
rochak

On Tue, Jul 7, 2009 at 9:47 PM, Syed Haider 
<[email protected]<mailto:[email protected]>> wrote:
Hi David,

it isnt the processing of coordinates which takes major proportion of the time, 
usually its the time taken by database to return the results. Could you please 
check if indices are in place for the table that serves the sequence 
coordinates ?

thanks
Syed

David M. Goodstein wrote:

On 7 Jul 2009, at 12:23, Syed Haider wrote:

Hi Rochak,

Rochak Neupane wrote:
When querying for sequences on a dataset set, without any filters (so as to 
grab complete set of sequences), marts seem to be quite slow. Pulling peptides 
for human, for example, took an excess of 7 minutes from ensembl mart. Grabbing 
peptide sequences from Caenorhabditis elegans (wormbase db, gene dataset from 
biomart.org<http://biomart.org> <http://biomart.org>) also took about 7 or so 
minutes for a file that turns out to be 9MB.
Our own mart is quite slow when querying a complete set of sequences from an 
organism. Is it typical for biomart to take 7-8minutes + when querying for 
whole genome sequences?

a- are you using GenomicSequence to retrieve sequences ?

b- do you have ORDER BY property set on sequence exportables (if 'a' is true). 
Setting ORDER BY slows down the response considerably and its only required to 
cope with inconsistencies in the row order that should really be fixed on the 
mart (database) construction level.

Best,
Syed

Removing the ORDER BY does improve performance (from 15 minutes down to 5 
minutes for an unfiltered FASTA grab of approx 30k peptides), but still not 
really something that's user-tolerable.  Is that really the expected behavior?

--David

David M. Goodstein
Joint Genome Institute / Lawrence Berkeley National Lab
Center for Integrative Genomics / UCBerkeley
http://www.phytozome.net

Thanks,
rochak

RE: [mart-dev] query performance on sequences

Reply via email to