phytozome__transcript__main has 607650 rows.
phytozome_structure__structure__main has 2970857 rows
and
sequence_phytozome__dna_chunks__main has 123043 rows.
As batching gets higher, grabbing data gets slow.
For example, selecting 12800 rows from transcript table is very
quick, the next step to get maximum of 50,000 coordinates from
structure took about 3 seconds, The following step, actually gets the
sequence, takes up the majority of time, more than 3 minutes.
a few lines of log immediately following the query for coordinates:
BioMart.DatasetI:1174:DEBUG> Date: Tue Jul 14 15:56:50 PDT 2009
BioMart.DatasetI:1175:DEBUG> Got results
BioMart.DatasetI:1206:WARN> Attribute merge using linkName:
transcript_id
BioMart.DatasetI:1207:WARN> Before merge: 50000
BioMart.DatasetI:1208:INFO> Date: Tue Jul 14 15:56:50 PDT 2009
BioMart.DatasetI:1296:DEBUG> Importable size: 1
BioMart.DatasetI:1297:DEBUG> Link name: transcript_id
BioMart.DatasetI:1312:DEBUG> Appending 1881276
BioMart.DatasetI:1316:DEBUG> Final key string is: 1881276
......
BioMart.DatasetI:1509:DEBUG> Finished with rows: 50000
BioMart.DatasetI:1213:WARN> After merge: 50000
BioMart.DatasetI:1214:INFO> Date: Tue Jul 14 15:57:02 PDT 2009
so, merging took about 12 seconds. the next step querying for
sequence for each records from above.
BioMart.DatasetI:1213:WARN> After merge: 50000
BioMart.DatasetI:1214:INFO> Date: Tue Jul 14 15:57:02 PDT 2009
BioMart.DatasetI:1265:DEBUG> Attribute hash
BioMart.DatasetI:1266:DEBUG> Before hash: 50000
BioMart.DatasetI:1273:DEBUG> After hash: 50000
BioMart.DatasetI:1280:DEBUG> Returning defined has_data
BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING
SQL CHR CHUNKSTART: scaffold_177 100001 131 (start 143411 coord
43411) orgId 77
BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING
SQL CHR CHUNKSTART: scaffold_177 100001 214 (start 142912 coord
42912) orgId 77
BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING
SQL CHR CHUNKSTART: scaffold_177 100001 117 (start 142490 coord
42490) orgId 77
BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING
SQL CHR CHUNKSTART: scaffold_177 100001 93 (start 142044 coord
42044) orgId 77
....
BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING
SQL CHR CHUNKSTART: scaffold_43 1500001 27 (start 1509796 coord
9796) orgId 77
BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING
SQL CHR CHUNKSTART: scaffold_43 1500001 633 (start 1509899 coord
9899) orgId 77
BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING
SQL CHR CHUNKSTART: scaffold_43 1500001 129 (start 1510789 coord
10789) orgId 77
BioMart.DatasetI:1174:DEBUG> Date: Tue Jul 14 16:00:25 PDT 2009
That took more than 3 minutes. Not sure how much of that is log4perl
writing to file, but since biomart takes just as long on our version
with LOG level set to error, this seems to be where most of the time
is spent.
rochak
On Jul 10, 2009, at Jul 10, 2009:2:12 PM , Junjun Zhang wrote:
Hi Rochak,
Sorry for not getting you back sooner. Based on the information you
provided below, all the necessary indexes are there, there seems
nothing quick we can do to speed up sequence queries significantly.
Of course, we are happy to look into this issue more closely. BTW,
how many rows do you have in these tables:
phytozome__transcript__main, phytozome_structure__structure__main
and sequence_phytozome__dna_chunks__main? It should be also helpful
if you could find out what proportion of the time each of the 3
steps (you described below) takes while retrieving 30k peptides.
In the meantime, if your mart users still have concerns about
spending 5 minutes on getting 30k sequences, you can probably
advice them to use the 'Email notification' option on the MartView
GUI.
We will keep you updated once we have something concrete.
Please feel free to contact us should you have any further questions.
Kind regards,
Junjun
From: [email protected] [mailto:[email protected]] On
Behalf Of Rochak Neupane
Sent: Wednesday, July 08, 2009 7:04 PM
To: Syed Haider
Cc: David M. Goodstein; BioMart mart-dev
Subject: Re: [mart-dev] query performance on sequences
From looking at the logs, it looks like when grabbing sequences,
BioMart makes 3 different types of queries.
1) query main table. transcript_id_key from transcript__main table
(exportable) - this is indexed.
Explain on query generated by biomart.
explain SELECT main.transcript_id_key, main.organism_name,
main.gene_name FROM phytozome_mart.phytozome__transcript__main main
WHERE (main.transcript_id_key = '16429347') AND (main.organism_id =
'113') LIMIT 200;
+----+-------------+-------+-------
+---------------------------------------------+---------+---------
+-------+------+-------+
| id | select_type | table | type |
possible_keys | key | key_len |
ref | rows | Extra |
+----+-------------+-------+-------
+---------------------------------------------+---------+---------
+-------+------+-------+
| 1 | SIMPLE | main | const |
PRIMARY,transcript_id_key,org_id,org_pac_id | PRIMARY | 4 |
const | 1 | |
+----+-------------+-------+-------
+---------------------------------------------+---------+---------
+-------+------+-------+
2) query structure table for coordinates, using transcript_id_key
from step 1. transcript_id_key is indexed in this table as well.
mysql> explain SELECT main.transcript_id_key,
main.transcript_id_key, main.chr_name, main.exon_cds_start,
main.exon_cds_end, main.exon_chrom_strand, main.exon_phase,
main.exon_rank, main.organism_id FROM
phytozome_mart.phytozome_structure__structure__main main WHERE
(main.transcript_id_key IN('16429347')) LIMIT 200;
+----+-------------+-------+------+-------------------
+-------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys |
key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-------------------
+-------------------+---------+-------+------+-------------+
| 1 | SIMPLE | main | ref | transcript_id_key |
transcript_id_key | 5 | const | 6 | Using where |
+----+-------------+-------+------+-------------------
+-------------------+---------+-------+------+-------------+
3) finally, using the locations from step 2, query sequence mart
using chr_name, chr_start, and org_id. all three of these are
indexed (index below)
BioMart.Dataset.GenomicSequence.DNAAdaptor:176:INFO> QUERY FULL
SQL: select sequence from sequence_phytozome__dna_chunks__main
where chr_start = ? and chr_name = ? and org_id = ?
BioMart.Dataset.GenomicSequence.DNAAdaptor:177:INFO> QUERY SUB
SQL: select substring(sequence, ?, ?) from
sequence_phytozome__dna_chunks__main where chr_start = ? and
chr_name = ? and org_id = ?
Trying an explain with values filled in:
explain select substring(sequence, 29665, 47) from
sequence_mart.sequence_phytozome__dna_chunks__main where chr_start
= 200001 and chr_name = 'scaffold_406' and org_id = 77;
+----+-------------+--------------------------------------+------
+-------------------------------+------+---------
+-------------------+------+-------------+
| id | select_type | table | type |
possible_keys | key | key_len | ref
| rows | Extra |
+----+-------------+--------------------------------------+------
+-------------------------------+------+---------
+-------------------+------+-------------+
| 1 | SIMPLE | sequence_phytozome__dna_chunks__main | ref |
chr_name,chr_start,org_id,nsi | nsi | 52 | const,const,const
| 1 | Using where |
+----+-------------+--------------------------------------+------
+-------------------------------+------+---------
+-------------------+------+-------------+
mysql> show index from
sequence_mart.sequence_phytozome__dna_chunks__main where key_name =
'nsi';
+--------------------------------------+------------+----------
+--------------+-------------+-----------+-------------+----------
+--------+------+------------+---------+
| Table | Non_unique | Key_name |
Seq_in_index | Column_name | Collation | Cardinality | Sub_part |
Packed | Null | Index_type | Comment |
+--------------------------------------+------------+----------
+--------------+-------------+-----------+-------------+----------
+--------+------+------------+---------+
| sequence_phytozome__dna_chunks__main | 1 | nsi
| 1 | chr_name | A | 41014 | NULL |
NULL | | BTREE | |
| sequence_phytozome__dna_chunks__main | 1 | nsi
| 2 | chr_start | A | 123043 | NULL |
NULL | YES | BTREE | |
| sequence_phytozome__dna_chunks__main | 1 | nsi
| 3 | org_id | A | 123043 | NULL |
NULL | YES | BTREE | |
+--------------------------------------+------------+----------
+--------------+-------------+-----------+-------------+----------
+--------+------+------------+---------+
seems like proper indexes are used when running these queries. Any
way to improve these indexes? Or any other ways to improve
performance?
Thanks,
rochak
On Tue, Jul 7, 2009 at 9:47 PM, Syed Haider <[email protected]> wrote:
Hi David,
it isnt the processing of coordinates which takes major proportion
of the time, usually its the time taken by database to return the
results. Could you please check if indices are in place for the
table that serves the sequence coordinates ?
thanks
Syed
David M. Goodstein wrote:
On 7 Jul 2009, at 12:23, Syed Haider wrote:
Hi Rochak,
Rochak Neupane wrote:
When querying for sequences on a dataset set, without any filters
(so as to grab complete set of sequences), marts seem to be quite
slow. Pulling peptides for human, for example, took an excess of 7
minutes from ensembl mart. Grabbing peptide sequences from
Caenorhabditis elegans (wormbase db, gene dataset from biomart.org
<http://biomart.org>) also took about 7 or so minutes for a file
that turns out to be 9MB.
Our own mart is quite slow when querying a complete set of
sequences from an organism. Is it typical for biomart to take
7-8minutes + when querying for whole genome sequences?
a- are you using GenomicSequence to retrieve sequences ?
b- do you have ORDER BY property set on sequence exportables (if
'a' is true). Setting ORDER BY slows down the response considerably
and its only required to cope with inconsistencies in the row order
that should really be fixed on the mart (database) construction level.
Best,
Syed
Removing the ORDER BY does improve performance (from 15 minutes
down to 5 minutes for an unfiltered FASTA grab of approx 30k
peptides), but still not really something that's user-tolerable.
Is that really the expected behavior?
--David
David M. Goodstein
Joint Genome Institute / Lawrence Berkeley National Lab
Center for Integrative Genomics / UCBerkeley
http://www.phytozome.net
Thanks,
rochak