Hi,
I was trying to automate regular downloads of human CDS (and UTRs)
using biomart. I have tried it using the perl script generated at
biomart:
use strict;
use BioMart::Initializer;
use BioMart::Query;
use BioMart::QueryRunner;
my $confFile =
"/home/projects/ensembl/biomart-perl/conf/apiExampleRegistry.xml";
my $action='cached';
my $initializer = BioMart::Initializer->new('registryFile'=>$confFile,
'action'=>$action);
my $registry = $initializer->getRegistry;
my $query =
BioMart::Query->new('registry'=>$registry,'virtualSchemaName'=>'default');
$query->setDataset("hsapiens_gene_ensembl");
$query->addAttribute("ensembl_gene_id");
$query->addAttribute("ensembl_transcript_id");
$query->addAttribute("coding");
$query->addAttribute("external_gene_id");
$query->formatter("FASTA");
my $query_runner = BioMart::QueryRunner->new();
# to obtain unique rows only
$query_runner->uniqueRowsOnly(1);
$query_runner->execute($query);
$query_runner->printHeader();
$query_runner->printResults();
$query_runner->printFooter();
This only retrieves a few sequences and then starts returning
"Problems with the web server: 500 read timeout"
I have also tried posting the XML using LWP in perl, this downloads
more sequences but this also stops after a while before downloading
all the sequences:
use strict;
use LWP::UserAgent;
open (FH,$ARGV[0]) || die ("\nUsage: perl postXML.pl Query.xml\n\n");
my $xml;
while (<FH>){
$xml .= $_;
}
close(FH);
my $path="http://www.biomart.org/biomart/martservice?";
my $request =
HTTP::Request->new("POST",$path,HTTP::Headers->new(),'query='.$xml."\n");
my $ua = LWP::UserAgent->new;
$ua->timeout(30000000);
my $response;
$ua->request($request,
sub{
my($data, $response) = @_;
if ($response->is_success) {
print "$data";
}
else {
warn ("Problems with the web server:
".$response->status_line);
}
},500);
I have managed to download all the sequences using the browser before,
but, it required several tries and I had to get them gzipped (also so
I could be sure I got all of them when gunzipping them).
So, my question is, is there anything I can do to be able to download
all the sequences? I.e. avoid timeouts, some easy, systematic, way to
split my calls into much smaller calls or something else?
Thanks,
Elfar