Galt,

as I mentioned in my first email, I am talking about Genbank mRNA 
"BM451627". If you go to the track description for mRNAs, you read:

--

Methods

GenBank human mRNAs were aligned against the genome using the blat 
program. When a single mRNA aligned in multiple places, the alignment 
having the highest base identity was found. Only alignments having a 
base identity level within 0.5% of the best and at least 96% base 
identity with the genomic sequence were kept.

--

I dont need a manual for blat or postfilterings, I want to know what you 
are doing for the genome browser alignments -- and whether I interprete 
it right that you keep both alignments of an 1200nt mRNA if one is a 
150nt subsequence with X% identity (over that 150nt) and the other 
alignment is a 1200nt alignment with also X% identity (over the 1200nt), 
supposed of course that X>.96.

Ei, I seem to express myself very complicated.

micha.


En/na Galt Barber ha escrit:
>
> How does "further filtered" translate into "no control" ?
> How does BLAT-output-behavior become "kept at UCSC" ?
>
> pslCDnaFilter for example can filter for coverage as well as many
> other things.
>
> If you look in the kent/src/hg/makeDb/doc/ directory
> you will seem plenty examples of what UCSC does with BLAT.
>
> BLAT can filter on %ID, it can filter on score.
> You can use this to throw out really terrible
> low-scoring alignments.
>
> But even just for normal use, people filter their BLAT output
> for desired properties.
>
> Real mRNAs for example have very long introns between short exons.
>
> Take it to heart when I tell you that you need to post-filter psls.
>
> -- 
>
> Here's an example of help screen from pslReps (included in blatSrcXX.zip)
>
> [hgwdev:blatSrc34> pslReps
> pslReps - analyse repeats and generate genome wide best
> alignments from a sorted set of local alignments
> usage:
>     pslReps in.psl out.psl out.psr
> where in.psl is an alignment file generated by psLayout and
> sorted by pslSort, out.psl is the best alignment output
> and out.psr contains repeat info
> options:
>     -nohead don't add PSL header
>     -ignoreSize Will not weigh in favor of larger alignments so much
>     -noIntrons Will not penalize for not having introns when calculating
>               size factor
>     -singleHit  Takes single best hit, not splitting into parts
>     -minCover=0.N minimum coverage to output.  Default is 0.
>     -ignoreNs Ignore 'N's when calculating minCover.
>     -minAli=0.N minimum alignment ratio
>                default is 0.93
>     -nearTop=0.N how much can deviate from top and be taken
>                default is 0.01
>     -minNearTopSize=N  Minimum size of alignment that is near top
>                for alignment to be kept.  Default 30.
>     -coverQSizes=file Tab-separate file with effective query sizes.
>                      When used with -minCover, this allows polyAs
>                      to be excluded from the coverage calculation
>
> pslCDnaFilter has many more options even than pslReps.
>
> -- 
>
> Here's an example sketch of BLAT use at UCSC.
>
> 1. mkdir someplace
>
> 2. faSplit query.fa and database.fa to make smaller chunks of e.g. genome
>
> 3. run gensub to create parasol batch-file
> pairing every query chunk against every db chunk.
>
> 4. run BLAT cluster job using parasol, each job
> just BLATs one chunk-pair.
>
> 5. Might need to use liftUp to correct coordinates for
> earlier splitting step.
>
> 6. cat all the results together into a big psl file.
>
> 7  use pslCDnaFilter or pslReps to filter the psls
> to have the desired characteristics.
>
> -Galt
>
>
> On Mon, 10 Nov 2008, Micha Sammeth wrote:
>
>> Hi Galt,
>>
>> somehow we seem to miss each other in understanding. My question is 
>> exactly targeting at the in my eyes missing filtering of the blat 
>> output for multiple alignments of the same query with equal 
>> identities but strongly differing coverages of the query. Do I 
>> understand correctly that there is no control at all, and all 
>> alignments with equal identities (>96% resp. >maxIdentity-0.5% of the 
>> aligned stretch) are kept at UCSC?
>>
>> Thank you, micha.
>>
>> En/na Galt Barber ha escrit:
>>>
>>> yes, it is common that the blat output
>>> is further filtered by some method.
>>>
>>> -Galt
>>>
>>>
>>> On Mon, 10 Nov 2008, Micha Sammeth wrote:
>>>
>>>> Hi Galt,
>>>>
>>>> thank you for the quick answer, I prefer programs to scripts. Did I
>>>> understand correctly that the sketched scenario of 10% of a trancript
>>>> aligning with the same identity as 100% of a transcript and both
>>>> alignments are kept occurs at UCSC? Is it frequent?
>>>>
>>>> Thank you, micha.
>>>>
>>>>
>>>> En/na Galt Barber ha escrit:
>>>>>
>>>>> Percent Identity only applies regions aligned.
>>>>> Gaps are not considered aligned regions of course.
>>>>>
>>>>> You are interested in the level of coverage.
>>>>> You can use utilities like pslReps and pslCDnaFilter
>>>>> to post-filter your BLAT psl results.  You can also
>>>>> just write your own script to filter it however you like.
>>>>>
>>>>> You might also try the BLAT options -maxGap=0 and -fastMap.
>>>>>
>>>>> -Galt
>>>>>
>>>>>
>>>>> On Mon, 10 Nov 2008, Micha Sammeth wrote:
>>>>>
>>>>>> Hello helpdesk,
>>>>>>
>>>>>> I have again a curiosity concerning blat alignments. I consider 
>>>>>> the case
>>>>>> BM451627. It was for some reasons not in the dataset of hg17, so I
>>>>>> downloaded the sequence and ran a blat  -stepSize=5 -minScore=0
>>>>>> -minIdentity=0, which should correspond to the settings used at 
>>>>>> UCSC.
>>>>>> Checking idendity, I find the highest score of about 20% sequence
>>>>>> identity -- match/query length -- it also does not change much when
>>>>>> additionally taking into account repmatch.
>>>>>>
>>>>>> However, I found BM451627 in the UCSC hg16 database, where it 
>>>>>> reports a
>>>>>> ~98% identity match. Looking closer at the alignment, in hg16 
>>>>>> there is a
>>>>>> ~150nt stretch from the 1244nt which aligns with 98% identity --- 
>>>>>> and
>>>>>> probably a couple of bases that changed from hg16->hg17 are 
>>>>>> responsible
>>>>>> that in this 150nt region sequence identity drops below the 
>>>>>> threshold of
>>>>>> 96%.
>>>>>>
>>>>>> My question now is:
>>>>>>
>>>>>> Does this hold for all identities, say a transcript aligns with 
>>>>>> 1000 nt
>>>>>> and 98% identity in one place and in another place with 100nt at 98%
>>>>>> will be put in both places, regardless of the coverage of the 
>>>>>> transcript
>>>>>> by the alignment? In other words, the identity criterion of 96% 
>>>>>> or 0.5%
>>>>>> of the best alignment is applied to match/(Qend-Qstart)? And if 
>>>>>> so, what
>>>>>> was the motivation to not take the "global identity" of the 
>>>>>> query, did
>>>>>> you have bad experiences with transcripts that did not want to align
>>>>>> that way?
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> micha.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Genome maillist  -  [email protected]
>>>>>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>>>>>>
>>>>>
>>>>
>>>> -- 
>>>> O       o O       o O       o    Dr. Michael Sammeth
>>>> | O   o | | O   o | | O   o |         http://www.sammeth.net
>>>> | | O | | | | O | GRIB| O   |         Phone: +34-933-160-166
>>>> | o   O | | o   O | | o   O |    Fax:   +34 933-969-983
>>>> o       O o       O o       O    Dr. Aiguader 88, 08003 Barcelona
>>>>
>>>> _______________________________________________
>>>> Genome maillist  -  [email protected]
>>>> http://www.soe.ucsc.edu/mailman/listinfo/genome
>>>>
>>>
>>
>> -- 
>> O       o O       o O       o    Dr. Michael Sammeth
>> | O   o | | O   o | | O   o |         http://www.sammeth.net
>> | | O | | | | O | GRIB| O   |         Phone: +34-933-160-166
>> | o   O | | o   O | | o   O |    Fax:   +34 933-969-983
>> o       O o       O o       O    Dr. Aiguader 88, 08003 Barcelona
>>
>

-- 
O       o O       o O       o    Dr. Michael Sammeth
| O   o | | O   o | | O   o |         http://www.sammeth.net
| | O | | | | O | GRIB| O   |            Phone: +34-933-160-166
| o   O | | o   O | | o   O |    Fax:   +34 933-969-983
o       O o       O o       O    Dr. Aiguader 88, 08003 Barcelona

_______________________________________________
Genome maillist  -  [email protected]
http://www.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to