Good Morning:
The examples you list here are an excellent case of demonstrating
how the masking blat functions. Your single result in the case
with the -mask=lower option shows that the result is in a region
of the genome with no repeat masking. This is what the option means:
> -mask=type Mask out repeats. Alignments won't be started in masked region
> but may extend through it in nucleotide searches. Masked areas
> are ignored entirely in protein or translated searches.
Take a look at the genomic sequence in that one result position:
$ twoBitToFa -seq=chr5 -start=135066875 -end=135066898 mm9.2bit stdout
>chr5:135066875-135066898
GGACCAGGCTGGCCTCACTCACA
All upper case, no masking. Verify with the rmsk table:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A \
-e select count(*) from chr5_rmsk WHERE (genoStart < 135066898 AND
genoEnd > 135066875); mm9
+----------+
| count(*) |
+----------+
| 0 |
+----------+
Zero elements in this area, no masking.
Your results obtained without the -mask=lower option all show results
in areas with repeat masked sequence, lower case letters:
$ twoBitToFa -seq=chr6 -start=83518525 -end=83518548 mm9.2bit stdout
>chr6:83518525-83518548
ctgtgtgtgaggccagcttggtc
$ twoBitToFa -seq=chr5 -start=144157072 -end=144157113 mm9.2bit stdout
>chr5:144157072-144157113
ggggagctctgtgtgtgaggccagcctggtccaggacagcc
$ twoBitToFa -seq=chr12 -start=52996615 -end=52996649 mm9.2bit stdout
>chr12:52996615-52996649
tgtgagttcaaagccagcctggtccgggacaacc
$ twoBitToFa -seq=chr9 -start=63135307 -end=63135798 mm9.2bit stdout
>chr9:63135307-63135798
gacctggctggcctcacacacagagatccaccagcccccgccttctggtg
ctgtgtttaaaggtgtgcaccaccatgcctagctGGTTCTTCTTAGGGTA
AAATGAGCAGAGGGTTTGGGTTTAGTTCTTACCCAGTTTCTCTATCATTT
CTATTCTATAAATCTAGCCAAAAAAGTAACTCTCAGGTAGATCAGAGCAA
ACATTAGTGCTAGACAGACCAGGAGTCTTGCTGCTGCCTGTGACGCTCTG
AAGCAGGGTCCAGGAAAGGACTTTAAGCCGACCACAGCAACACTATCACT
GCCATCACTGTGTGCTCTCAAGACAGCTGCTCTGAGAGCCCAGCTCACTG
AAAAGTAGAAACTTAGGACTTCCTGTTAATGGTCGGCATTGGTTAATCGT
TGTAAGAATCCCAAAAATATACTTATATACCTTGAAGGTAACAAGCACCT
TACTTACTAACAATATGAGAACTAAGGTAACACATCTCCCC
$ twoBitToFa -seq=chr11 -start=101420081 -end=101420101 mm9.2bit stdout
>chr11:101420081-101420101
ccaggctggcctcacacaca
Verify with the rmsk tables:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*)
from chr6_rmsk
WHERE (genoStart < 83518548 AND genoEnd > 83518525);" mm9
+----------+
| count(*) |
+----------+
| 1 |
+----------+
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*)
from chr5_rmsk
WHERE (genoStart < 144157113 AND genoEnd > 144157072);" mm9
+----------+
| count(*) |
+----------+
| 1 |
+----------+
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*)
from chr12_rmsk
WHERE (genoStart < 52996649 AND genoEnd > 52996615);" mm9
+----------+
| count(*) |
+----------+
| 1 |
+----------+
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*)
from chr9_rmsk
WHERE (genoStart < 63135798 AND genoEnd > 63135307);" mm9
+----------+
| count(*) |
+----------+
| 1 |
+----------+
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select count(*)
from chr11_rmsk
WHERE (genoStart < 101420101 AND genoEnd > 101420081);" mm9
+----------+
| count(*) |
+----------+
| 1 |
+----------+
Your best match is at chr5:144,157,073-144,157,113
which is entirely within a repeat masked region. You should evaluate
the requirements of your experiment and decide if you should be eliminating
matches that fall within repeat masked regions.
--Hiram
Peng Yu wrote:
> I have the following sequence in the query.fasta file. I run the following
> commands (one with mask, one without mask) and I got the following results.
> The one without mask gives me five results (just as in the online blat). The
> one with mask gives me only one result, which is not in the five results of
> the other run. I looked at the genomic region around the five results, I
> found no repeatMasker marked region. Therefore, I don't understand why the
> masked run doesn't give the same results as the unmasked run. I know that
> -mask=lower means excluding lower cased part of the genome, but what those
> lower case regions mean? Could you help me understand the -mask option a
> little better?
>
>> HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3
> GGGGAGCTCTGTGTGTGAGGCCAGCCTGGTCCGGGACAGCC
>
>
>
> ###########################
> blat -noHead -t=dna -q=dna -tileSize=11 -stepSize=5 -minScore=20 -out=psl
> mm9.2bit -mask=lower query.fasta mask.psl
>
> 22 1 0 0 0 0 0 0 - HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 9 32
> chr5 152537259 135066875 135066898 1 23, 9, 135066875,
> ##########################
>
>
> #########################
> blat -noHead -t=dna -q=dna -tileSize=11 -stepSize=5 -minScore=20
> -ooc=mm9_11.ooc -out=psl mm9.2bit query.fasta mask.psl
>
> 22 1 0 0 0 0 0 0 + HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 8 31
> chr6 149517037 83518525 83518548 1 23, 8, 83518525,
> 40 1 0 0 0 0 0 0 + HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 0 41
> chr5 152537259 144157072 144157113 1 41, 0, 144157072,
> 27 1 0 0 0 0 1 6 + HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 13 41
> chr12 121257530 52996615 52996649 2 6,22, 13,19, 52996615,52996627,
> 30 1 0 0 0 0 1 460 - HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 0 31
> chr9 124076172 63135307 63135798 2 25,6, 10,35, 63135307,63135792,
> 20 0 0 0 0 0 0 0 - HWI-EAS11X_10097:4:1:1049:11109#0__1__3__41__3 41 9 29
> chr11 121843856 101420081 101420101 1 20, 12, 101420081,
> #######################
>
>
_______________________________________________
Genome maillist - [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome