Re: [Genome] About -mask=lower option in blat?

Hiram Clawson Mon, 21 Jun 2010 10:55:10 -0700

Good Morning:

The examples you list here are an excellent case of demonstrating
how the masking blat functions.  Your single result in the case
with the -mask=lower option shows that the result is in a region
of the genome with no repeat masking.  This is what the option means:


> -mask=type  Mask out repeats.  Alignments won't be started in masked region
>             but may extend through it in nucleotide searches.  Masked areas
>             are ignored entirely in protein or translated searches.

Take a look at the genomic sequence in that one result position:

$ twoBitToFa -seq=chr5 -start=135066875 -end=135066898 mm9.2bit stdout
 >chr5:135066875-135066898
GGACCAGGCTGGCCTCACTCACA

All upper case, no masking.  Verify with the rmsk table:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A \
        -e select count(*) from chr5_rmsk WHERE (genoStart < 135066898 AND 
genoEnd > 135066875); mm9
+----------+
| count(*) |
+----------+
|        0 |
+----------+

Zero elements in this area, no masking.

Your results obtained without the -mask=lower option all show results
in areas with repeat masked sequence, lower case letters:

$ twoBitToFa -seq=chr6 -start=83518525 -end=83518548 mm9.2bit stdout
 >chr6:83518525-83518548
ctgtgtgtgaggccagcttggtc
$ twoBitToFa -seq=chr5 -start=144157072 -end=144157113 mm9.2bit stdout
 >chr5:144157072-144157113
ggggagctctgtgtgtgaggccagcctggtccaggacagcc
$ twoBitToFa -seq=chr12 -start=52996615 -end=52996649 mm9.2bit stdout
 >chr12:52996615-52996649
tgtgagttcaaagccagcctggtccgggacaacc
$ twoBitToFa -seq=chr9 -start=63135307 -end=63135798 mm9.2bit stdout
 >chr9:63135307-63135798
gacctggctggcctcacacacagagatccaccagcccccgccttctggtg
ctgtgtttaaaggtgtgcaccaccatgcctagctGGTTCTTCTTAGGGTA
AAATGAGCAGAGGGTTTGGGTTTAGTTCTTACCCAGTTTCTCTATCATTT
CTATTCTATAAATCTAGCCAAAAAAGTAACTCTCAGGTAGATCAGAGCAA
ACATTAGTGCTAGACAGACCAGGAGTCTTGCTGCTGCCTGTGACGCTCTG
AAGCAGGGTCCAGGAAAGGACTTTAAGCCGACCACAGCAACACTATCACT
GCCATCACTGTGTGCTCTCAAGACAGCTGCTCTGAGAGCCCAGCTCACTG
AAAAGTAGAAACTTAGGACTTCCTGTTAATGGTCGGCATTGGTTAATCGT
TGTAAGAATCCCAAAAATATACTTATATACCTTGAAGGTAACAAGCACCT
TACTTACTAACAATATGAGAACTAAGGTAACACATCTCCCC
$ twoBitToFa -seq=chr11 -start=101420081 -end=101420101 mm9.2bit stdout
 >chr11:101420081-101420101
ccaggctggcctcacacaca

Verify with the rmsk tables:


$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*) 
from chr6_rmsk 
WHERE (genoStart < 83518548 AND genoEnd > 83518525);" mm9
+----------+
| count(*) |
+----------+
|        1 |
+----------+
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*) 
from chr5_rmsk 
WHERE (genoStart < 144157113 AND genoEnd > 144157072);" mm9
+----------+
| count(*) |
+----------+
|        1 |
+----------+
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*) 
from chr12_rmsk 
WHERE (genoStart < 52996649 AND genoEnd > 52996615);" mm9
+----------+
| count(*) |
+----------+
|        1 |
+----------+


$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*) 
from chr9_rmsk 
WHERE (genoStart < 63135798 AND genoEnd > 63135307);" mm9
+----------+
| count(*) |
+----------+
|        1 |
+----------+
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*) 
from chr11_rmsk 
WHERE (genoStart < 101420101 AND genoEnd > 101420081);" mm9
+----------+
| count(*) |
+----------+
|        1 |
+----------+


Your best match is at chr5:144,157,073-144,157,113
which is entirely within a repeat masked region.  You should evaluate
the requirements of your experiment and decide if you should be eliminating
matches that fall within repeat masked regions.

--Hiram

Peng Yu wrote:
> I have the following sequence in the query.fasta file. I run the following
> commands (one with mask, one without mask) and I got the following results.
> The one without mask gives me five results (just as in the online blat). The
> one with mask gives me only one result, which is not in the five results of
> the other run. I looked at the genomic region around the five results, I
> found no repeatMasker marked region. Therefore, I don't understand why the
> masked run doesn't give the same results as the unmasked run. I know that
> -mask=lower means excluding lower cased part of the genome, but what those
> lower case regions mean? Could you help me understand the -mask option a
> little better?
> 
>> HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3
> GGGGAGCTCTGTGTGTGAGGCCAGCCTGGTCCGGGACAGCC
> 
> 
> 
> ###########################
> blat -noHead -t=dna -q=dna -tileSize=11 -stepSize=5 -minScore=20 -out=psl
> mm9.2bit -mask=lower query.fasta mask.psl
> 
> 22 1 0 0 0 0 0 0 - HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 9 32
> chr5 152537259 135066875 135066898 1 23, 9, 135066875,
> ##########################
> 
> 
> #########################
> blat -noHead -t=dna -q=dna -tileSize=11 -stepSize=5 -minScore=20
> -ooc=mm9_11.ooc -out=psl mm9.2bit query.fasta mask.psl
> 
> 22 1 0 0 0 0 0 0 + HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 8 31
> chr6 149517037 83518525 83518548 1 23, 8, 83518525,
> 40 1 0 0 0 0 0 0 + HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 0 41
> chr5 152537259 144157072 144157113 1 41, 0, 144157072,
> 27 1 0 0 0 0 1 6 + HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 13 41
> chr12 121257530 52996615 52996649 2 6,22, 13,19, 52996615,52996627,
> 30 1 0 0 0 0 1 460 - HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 0 31
> chr9 124076172 63135307 63135798 2 25,6, 10,35, 63135307,63135792,
> 20 0 0 0 0 0 0 0 - HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 9 29
> chr11 121843856 101420081 101420101 1 20, 12, 101420081,
> #######################
> 
> 
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] About -mask=lower option in blat?

Reply via email to