Hi, Tony!

Reminder: there are many excellent short-read aligners available today.

-Galt

On Mon, Jul 16, 2012 at 5:56 PM, Tony Travis <[email protected]> wrote:

> On 16/07/12 22:28, Mbandi S.K wrote:
>
>> Dear Tony,
>>
>> I just wish to mention here that we have effectively mapped millions of
>> paired-end illumina sequences on to fungal size genomes by effectively
>> splitting the query into files of 1000-5000 sequences on a desktop running
>> Ubuntu 10.04 and 32bit. We simply use the unix command: split -l 2 -a 5 -d
>> fasta_file x/suffix  to generate the files in directory 'x'. We have be
>> using this approach to visually assess assembly quality through
>> integration with Gbrowse.
>>
>
> Hi, Mbandi.
>
> I use Jim Kent's faSplit to do this sort of thing:
>
>   
> http://genomewiki.ucsc.edu/**index.php/Kent_source_**utilities<http://genomewiki.ucsc.edu/index.php/Kent_source_utilities>
>
> There must be a typo in your split command, because "-l 2" means create
> output with two lines per file (by default, split creates output with 1000
> lines per file). Even that would create hundreds of thousands of files in
> my case, and each time I run BLAT on one of these files, I have to read the
> database and create indexes, but I'm reading about using the "gfServer"
> instead to keep these indexes in memory...
>
> We are looking for chimeric hybrid reads in Illumina deep sequencing data
> and we typically have 20 million raw reads, deduped and quality filtered. I
> want to BLAT against the full human genome because we are looking for
> matches to intronic sequences. We have already identified miRNA/mRNA
> chimeras using smaller databases. I'm very impressed by the performance of
> BLAT, but as I posted here we get segfaults running it with a ~20GiB
> database created from a full Ensemble version of hg19.
>
> Our BLAT does not segfault on smaller databases and I'll follow the advice
> of people on this list to split the database into smaller parts. It's quite
> likely that I did something wrong creating my 2bit format database from the
> top-level Ensemble hg19 FASTA file:
>
>   Homo_sapiens.GRCh37.67.dna.**toplevel.fa
>
> The 2bit version of hg19 that I downloaded from UCSC is a lot smaller and
> "blat" doesn't cause any segfaults:
>
>   
> http://hgdownload.cse.ucsc.**edu/goldenPath/hg19/bigZips/**hg19.2bit<http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit>
>
> This runs in < 4GiB RAM, but mine uses >20GiB and causes segfaults :-(
>
> I put everything into my BLAT 2bit DB, including all the supercontigs and
> patches etc., which was probably not the right thing to do...
>
> Thanks for your helpful comments,
>
>   Tony.
>
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to