phytozome__transcript__main has 607650 rows.
phytozome_structure__structure__main has 2970857 rows
and
sequence_phytozome__dna_chunks__main has 123043 rows.

As batching gets higher, grabbing data gets slow.
For example, selecting 12800 rows from transcript table is very quick, the next step to get maximum of 50,000 coordinates from structure took about 3 seconds, The following step, actually gets the sequence, takes up the majority of time, more than 3 minutes.


a few lines of log immediately following the query for coordinates:
BioMart.DatasetI:1174:DEBUG> Date: Tue Jul 14 15:56:50 PDT 2009
BioMart.DatasetI:1175:DEBUG> Got results
BioMart.DatasetI:1206:WARN> Attribute merge using linkName: transcript_id
BioMart.DatasetI:1207:WARN> Before merge: 50000
BioMart.DatasetI:1208:INFO> Date: Tue Jul 14 15:56:50 PDT 2009
BioMart.DatasetI:1296:DEBUG> Importable size: 1
BioMart.DatasetI:1297:DEBUG> Link name: transcript_id
BioMart.DatasetI:1312:DEBUG> Appending 1881276
BioMart.DatasetI:1316:DEBUG> Final key string is: 1881276
......
BioMart.DatasetI:1509:DEBUG> Finished with rows: 50000
BioMart.DatasetI:1213:WARN> After merge: 50000
BioMart.DatasetI:1214:INFO> Date: Tue Jul 14 15:57:02 PDT 2009


so, merging took about 12 seconds. the next step querying for sequence for each records from above.

BioMart.DatasetI:1213:WARN> After merge: 50000
BioMart.DatasetI:1214:INFO> Date: Tue Jul 14 15:57:02 PDT 2009
BioMart.DatasetI:1265:DEBUG> Attribute hash
BioMart.DatasetI:1266:DEBUG> Before hash: 50000
BioMart.DatasetI:1273:DEBUG> After hash: 50000
BioMart.DatasetI:1280:DEBUG> Returning defined has_data
BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING SQL CHR CHUNKSTART: scaffold_177 100001 131 (start 143411 coord 43411) orgId 77 BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING SQL CHR CHUNKSTART: scaffold_177 100001 214 (start 142912 coord 42912) orgId 77 BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING SQL CHR CHUNKSTART: scaffold_177 100001 117 (start 142490 coord 42490) orgId 77 BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING SQL CHR CHUNKSTART: scaffold_177 100001 93 (start 142044 coord 42044) orgId 77
....

BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING SQL CHR CHUNKSTART: scaffold_43 1500001 27 (start 1509796 coord 9796) orgId 77 BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING SQL CHR CHUNKSTART: scaffold_43 1500001 633 (start 1509899 coord 9899) orgId 77 BioMart.Dataset.GenomicSequence.DNAAdaptor:239:INFO> QUERY SUBSTRING SQL CHR CHUNKSTART: scaffold_43 1500001 129 (start 1510789 coord 10789) orgId 77
BioMart.DatasetI:1174:DEBUG> Date: Tue Jul 14 16:00:25 PDT 2009

That took more than 3 minutes. Not sure how much of that is log4perl writing to file, but since biomart takes just as long on our version with LOG level set to error, this seems to be where most of the time is spent.

rochak


On Jul 10, 2009, at Jul 10, 2009:2:12 PM , Junjun Zhang wrote:

Hi Rochak,

Sorry for not getting you back sooner. Based on the information you provided below, all the necessary indexes are there, there seems nothing quick we can do to speed up sequence queries significantly. Of course, we are happy to look into this issue more closely. BTW, how many rows do you have in these tables: phytozome__transcript__main, phytozome_structure__structure__main and sequence_phytozome__dna_chunks__main? It should be also helpful if you could find out what proportion of the time each of the 3 steps (you described below) takes while retrieving 30k peptides.

In the meantime, if your mart users still have concerns about spending 5 minutes on getting 30k sequences, you can probably advice them to use the 'Email notification' option on the MartView GUI.

We will keep you updated once we have something concrete.

Please feel free to contact us should you have any further questions.

Kind regards,
Junjun




From: [email protected] [mailto:[email protected]] On Behalf Of Rochak Neupane
Sent: Wednesday, July 08, 2009 7:04 PM
To: Syed Haider
Cc: David M. Goodstein; BioMart mart-dev
Subject: Re: [mart-dev] query performance on sequences

From looking at the logs, it looks like when grabbing sequences, BioMart makes 3 different types of queries.

1) query main table. transcript_id_key from transcript__main table (exportable) - this is indexed.
Explain on query generated by biomart.

explain SELECT main.transcript_id_key, main.organism_name, main.gene_name FROM phytozome_mart.phytozome__transcript__main main WHERE (main.transcript_id_key = '16429347') AND (main.organism_id = '113') LIMIT 200; +----+-------------+-------+------- +---------------------------------------------+---------+--------- +-------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------- +---------------------------------------------+---------+--------- +-------+------+-------+ | 1 | SIMPLE | main | const | PRIMARY,transcript_id_key,org_id,org_pac_id | PRIMARY | 4 | const | 1 | | +----+-------------+-------+------- +---------------------------------------------+---------+--------- +-------+------+-------+

2) query structure table for coordinates, using transcript_id_key from step 1. transcript_id_key is indexed in this table as well. mysql> explain SELECT main.transcript_id_key, main.transcript_id_key, main.chr_name, main.exon_cds_start, main.exon_cds_end, main.exon_chrom_strand, main.exon_phase, main.exon_rank, main.organism_id FROM phytozome_mart.phytozome_structure__structure__main main WHERE (main.transcript_id_key IN('16429347')) LIMIT 200; +----+-------------+-------+------+------------------- +-------------------+---------+-------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+------------------- +-------------------+---------+-------+------+-------------+ | 1 | SIMPLE | main | ref | transcript_id_key | transcript_id_key | 5 | const | 6 | Using where | +----+-------------+-------+------+------------------- +-------------------+---------+-------+------+-------------+

3) finally, using the locations from step 2, query sequence mart using chr_name, chr_start, and org_id. all three of these are indexed (index below) BioMart.Dataset.GenomicSequence.DNAAdaptor:176:INFO> QUERY FULL SQL: select sequence from sequence_phytozome__dna_chunks__main where chr_start = ? and chr_name = ? and org_id = ? BioMart.Dataset.GenomicSequence.DNAAdaptor:177:INFO> QUERY SUB SQL: select substring(sequence, ?, ?) from sequence_phytozome__dna_chunks__main where chr_start = ? and chr_name = ? and org_id = ?

Trying an explain with values filled in:
explain select substring(sequence, 29665, 47) from sequence_mart.sequence_phytozome__dna_chunks__main where chr_start = 200001 and chr_name = 'scaffold_406' and org_id = 77; +----+-------------+--------------------------------------+------ +-------------------------------+------+--------- +-------------------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------------------------------+------ +-------------------------------+------+--------- +-------------------+------+-------------+ | 1 | SIMPLE | sequence_phytozome__dna_chunks__main | ref | chr_name,chr_start,org_id,nsi | nsi | 52 | const,const,const | 1 | Using where | +----+-------------+--------------------------------------+------ +-------------------------------+------+--------- +-------------------+------+-------------+

mysql> show index from sequence_mart.sequence_phytozome__dna_chunks__main where key_name = 'nsi'; +--------------------------------------+------------+---------- +--------------+-------------+-----------+-------------+---------- +--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +--------------------------------------+------------+---------- +--------------+-------------+-----------+-------------+---------- +--------+------+------------+---------+ | sequence_phytozome__dna_chunks__main | 1 | nsi | 1 | chr_name | A | 41014 | NULL | NULL | | BTREE | | | sequence_phytozome__dna_chunks__main | 1 | nsi | 2 | chr_start | A | 123043 | NULL | NULL | YES | BTREE | | | sequence_phytozome__dna_chunks__main | 1 | nsi | 3 | org_id | A | 123043 | NULL | NULL | YES | BTREE | | +--------------------------------------+------------+---------- +--------------+-------------+-----------+-------------+---------- +--------+------+------------+---------+

seems like proper indexes are used when running these queries. Any way to improve these indexes? Or any other ways to improve performance?

Thanks,
rochak

On Tue, Jul 7, 2009 at 9:47 PM, Syed Haider <[email protected]> wrote:
Hi David,

it isnt the processing of coordinates which takes major proportion of the time, usually its the time taken by database to return the results. Could you please check if indices are in place for the table that serves the sequence coordinates ?

thanks
Syed


David M. Goodstein wrote:




On 7 Jul 2009, at 12:23, Syed Haider wrote:

Hi Rochak,

Rochak Neupane wrote:
When querying for sequences on a dataset set, without any filters (so as to grab complete set of sequences), marts seem to be quite slow. Pulling peptides for human, for example, took an excess of 7 minutes from ensembl mart. Grabbing peptide sequences from Caenorhabditis elegans (wormbase db, gene dataset from biomart.org <http://biomart.org>) also took about 7 or so minutes for a file that turns out to be 9MB. Our own mart is quite slow when querying a complete set of sequences from an organism. Is it typical for biomart to take 7-8minutes + when querying for whole genome sequences?

a- are you using GenomicSequence to retrieve sequences ?

b- do you have ORDER BY property set on sequence exportables (if 'a' is true). Setting ORDER BY slows down the response considerably and its only required to cope with inconsistencies in the row order that should really be fixed on the mart (database) construction level.

Best,
Syed


Removing the ORDER BY does improve performance (from 15 minutes down to 5 minutes for an unfiltered FASTA grab of approx 30k peptides), but still not really something that's user-tolerable. Is that really the expected behavior?

--David

David M. Goodstein
Joint Genome Institute / Lawrence Berkeley National Lab
Center for Integrative Genomics / UCBerkeley
http://www.phytozome.net



Thanks,
rochak



Reply via email to