Re: [mart-dev] query performance on sequences

Rochak Neupane Wed, 15 Jul 2009 18:24:28 -0700

phytozome__transcript__main has 607650 rows.
phytozome_structure__structure__main has 2970857 rows
and
sequence_phytozome__dna_chunks__main has 123043 rows.


As batching gets higher, grabbing data gets slow.

For example, selecting 12800 rows from transcript table is veryquick, the next step to get maximum of 50,000 coordinates fromstructure took about 3 seconds, The following step, actually gets thesequence, takes up the majority of time, more than 3 minutes.



a few lines of log immediately following the query for coordinates:
BioMart.DatasetI:1174:DEBUG> Date: Tue Jul 14 15:56:50 PDT 2009
BioMart.DatasetI:1175:DEBUG> Got results

BioMart.DatasetI:1206:WARN> Attribute merge using linkName:transcript_id

BioMart.DatasetI:1207:WARN> Before merge: 50000
BioMart.DatasetI:1208:INFO> Date: Tue Jul 14 15:56:50 PDT 2009
BioMart.DatasetI:1296:DEBUG> Importable size: 1
BioMart.DatasetI:1297:DEBUG> Link name: transcript_id
BioMart.DatasetI:1312:DEBUG> Appending 1881276
BioMart.DatasetI:1316:DEBUG> Final key string is: 1881276
......
BioMart.DatasetI:1509:DEBUG> Finished with rows: 50000
BioMart.DatasetI:1213:WARN> After merge: 50000
BioMart.DatasetI:1214:INFO> Date: Tue Jul 14 15:57:02 PDT 2009

so, merging took about 12 seconds. the next step querying forsequence for each records from above.


BioMart.DatasetI:1213:WARN> After merge: 50000
BioMart.DatasetI:1214:INFO> Date: Tue Jul 14 15:57:02 PDT 2009
BioMart.DatasetI:1265:DEBUG> Attribute hash
BioMart.DatasetI:1266:DEBUG> Before hash: 50000
BioMart.DatasetI:1273:DEBUG> After hash: 50000
BioMart.DatasetI:1280:DEBUG> Returning defined has_data

BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRINGSQL CHR CHUNKSTART: scaffold_177 100001 131 (start 143411 coord43411) orgId 77BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRINGSQL CHR CHUNKSTART: scaffold_177 100001 214 (start 142912 coord42912) orgId 77BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRINGSQL CHR CHUNKSTART: scaffold_177 100001 117 (start 142490 coord42490) orgId 77BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRINGSQL CHR CHUNKSTART: scaffold_177 100001 93 (start 142044 coord42044) orgId 77

....

BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRINGSQL CHR CHUNKSTART: scaffold_43 1500001 27 (start 1509796 coord9796) orgId 77BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRINGSQL CHR CHUNKSTART: scaffold_43 1500001 633 (start 1509899 coord9899) orgId 77BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRINGSQL CHR CHUNKSTART: scaffold_43 1500001 129 (start 1510789 coord10789) orgId 77

BioMart.DatasetI:1174:DEBUG> Date: Tue Jul 14 16:00:25 PDT 2009

That took more than 3 minutes. Not sure how much of that is log4perlwriting to file, but since biomart takes just as long on our versionwith LOG level set to error, this seems to be where most of the timeis spent.


rochak


On Jul 10, 2009, at Jul 10, 2009:2:12 PM , Junjun Zhang wrote:

Hi Rochak,
Sorry for not getting you back sooner. Based on the information youprovided below, all the necessary indexes are there, there seemsnothing quick we can do to speed up sequence queries significantly.Of course, we are happy to look into this issue more closely. BTW,how many rows do you have in these tables:phytozome__transcript__main, phytozome_structure__structure__mainand sequence_phytozome__dna_chunks__main? It should be also helpfulif you could find out what proportion of the time each of the 3steps (you described below) takes while retrieving 30k peptides.
In the meantime, if your mart users still have concerns aboutspending 5 minutes on getting 30k sequences, you can probablyadvice them to use the 'Email notification' option on the MartViewGUI.
We will keep you updated once we have something concrete.

Please feel free to contact us should you have any further questions.

Kind regards,
Junjun
From: [email protected] [mailto:[email protected]] OnBehalf Of Rochak Neupane
Sent: Wednesday, July 08, 2009 7:04 PM
To: Syed Haider
Cc: David M. Goodstein; BioMart mart-dev
Subject: Re: [mart-dev] query performance on sequences
From looking at the logs, it looks like when grabbing sequences,BioMart makes 3 different types of queries.
1) query main table. transcript_id_key from transcript__main table(exportable) - this is indexed.
Explain on query generated by biomart.
explain SELECT main.transcript_id_key, main.organism_name,main.gene_name FROM phytozome_mart.phytozome__transcript__main mainWHERE (main.transcript_id_key = '16429347') AND (main.organism_id ='113') LIMIT 200;+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+| id | select_type | table | type |possible_keys | key | key_len |ref | rows | Extra |+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+| 1 | SIMPLE | main | const |PRIMARY,transcript_id_key,org_id,org_pac_id | PRIMARY | 4 |const | 1 | |+----+-------------+-------+-------+---------------------------------------------+---------+---------+-------+------+-------+
2) query structure table for coordinates, using transcript_id_keyfrom step 1. transcript_id_key is indexed in this table as well.mysql> explain SELECT main.transcript_id_key,main.transcript_id_key, main.chr_name, main.exon_cds_start,main.exon_cds_end, main.exon_chrom_strand, main.exon_phase,main.exon_rank, main.organism_id FROMphytozome_mart.phytozome_structure__structure__main main WHERE(main.transcript_id_key IN('16429347')) LIMIT 200;+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+| id | select_type | table | type | possible_keys |key | key_len | ref | rows | Extra |+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+| 1 | SIMPLE | main | ref | transcript_id_key |transcript_id_key | 5 | const | 6 | Using where |+----+-------------+-------+------+-------------------+-------------------+---------+-------+------+-------------+
3) finally, using the locations from step 2, query sequence martusing chr_name, chr_start, and org_id. all three of these areindexed (index below)BioMart.Dataset.GenomicSequence.DNAAdaptor:176:INFO> QUERY FULLSQL: select sequence from sequence_phytozome__dna_chunks__mainwhere chr_start = ? and chr_name = ? and org_id = ?BioMart.Dataset.GenomicSequence.DNAAdaptor:177:INFO> QUERY SUBSQL: select substring(sequence, ?, ?) fromsequence_phytozome__dna_chunks__main where chr_start = ? andchr_name = ? and org_id = ?
Trying an explain with values filled in:
explain select substring(sequence, 29665, 47) fromsequence_mart.sequence_phytozome__dna_chunks__main where chr_start= 200001 and chr_name = 'scaffold_406' and org_id = 77;+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+| id | select_type | table | type |possible_keys | key | key_len | ref| rows | Extra |+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+| 1 | SIMPLE | sequence_phytozome__dna_chunks__main | ref |chr_name,chr_start,org_id,nsi | nsi | 52 | const,const,const| 1 | Using where |+----+-------------+--------------------------------------+------+-------------------------------+------+---------+-------------------+------+-------------+
mysql> show index fromsequence_mart.sequence_phytozome__dna_chunks__main where key_name ='nsi';+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+| Table | Non_unique | Key_name |Seq_in_index | Column_name | Collation | Cardinality | Sub_part |Packed | Null | Index_type | Comment |+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+| sequence_phytozome__dna_chunks__main | 1 | nsi| 1 | chr_name | A | 41014 | NULL |NULL | | BTREE | || sequence_phytozome__dna_chunks__main | 1 | nsi| 2 | chr_start | A | 123043 | NULL |NULL | YES | BTREE | || sequence_phytozome__dna_chunks__main | 1 | nsi| 3 | org_id | A | 123043 | NULL |NULL | YES | BTREE | |+--------------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
seems like proper indexes are used when running these queries. Anyway to improve these indexes? Or any other ways to improveperformance?
Thanks,
rochak

On Tue, Jul 7, 2009 at 9:47 PM, Syed Haider <[email protected]> wrote:
Hi David,
it isnt the processing of coordinates which takes major proportionof the time, usually its the time taken by database to return theresults. Could you please check if indices are in place for thetable that serves the sequence coordinates ?
thanks
Syed


David M. Goodstein wrote:




On 7 Jul 2009, at 12:23, Syed Haider wrote:

Hi Rochak,

Rochak Neupane wrote:
When querying for sequences on a dataset set, without any filters(so as to grab complete set of sequences), marts seem to be quiteslow. Pulling peptides for human, for example, took an excess of 7minutes from ensembl mart. Grabbing peptide sequences fromCaenorhabditis elegans (wormbase db, gene dataset from biomart.org<http://biomart.org>) also took about 7 or so minutes for a filethat turns out to be 9MB.Our own mart is quite slow when querying a complete set ofsequences from an organism. Is it typical for biomart to take7-8minutes + when querying for whole genome sequences?
a- are you using GenomicSequence to retrieve sequences ?
b- do you have ORDER BY property set on sequence exportables (if'a' is true). Setting ORDER BY slows down the response considerablyand its only required to cope with inconsistencies in the row orderthat should really be fixed on the mart (database) construction level.
Best,
Syed
Removing the ORDER BY does improve performance (from 15 minutesdown to 5 minutes for an unfiltered FASTA grab of approx 30kpeptides), but still not really something that's user-tolerable.Is that really the expected behavior?
--David

David M. Goodstein
Joint Genome Institute / Lawrence Berkeley National Lab
Center for Integrative Genomics / UCBerkeley
http://www.phytozome.net



Thanks,
rochak

Re: [mart-dev] query performance on sequences

Reply via email to