Re: [Bioc-sig-seq] unexpected genes names list using getBM{biomaRt}

Ramzi TEMANNI Tue, 08 Dec 2009 08:52:36 -0800

Hi James,
you are right regarding the hg19/GrCh37 issue the problem was related to
string factor conversion to string but now is corrected.
as you said seems that BiomaRt did not formulate the query as i whish so I
finish by using a loop to get the requested result.
The following code works fine :
get.genenames.lookup <-function(paired.reads)
{
library(biomaRt)
mart = useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)
gnm1<-c()
gnm2<-c()
for (i in 1:dim(paired.reads)[1])
{
chr.name1<-as.numeric(substr(paired.reads[i,2], 4,
nchar(paired.reads[i,2])))
chr.start1<- as.numeric(paired.reads[i,3])
chr.end1<-chr.start1+50
chr.name2<-as.numeric(substr(paired.reads[i,6], 4,
nchar(paired.reads[i,6])))
chr.start2<- as.numeric(paired.reads[i,7])
chr.end2<-chr.start2+50



# transform coordinates in genes names

gn.m1<-getBM(attributes= c("hgnc_symbol"),
       filters=c("chromosome_name","start","end"),
       values=list(chr.name1,chr.start1,chr.end1), mart=ensembl)
if (length(gn.m1[[1]])<1) {gn.m1="00000"}
gnm1<-c(gnm1,gn.m1[[1]])
gn.m2<-getBM(attributes= c("hgnc_symbol"),
       filters=c("chromosome_name","start","end"),
       values=list(chr.name2,chr.start2,chr.end2), mart=ensembl)
if (length(gn.m2[[1]])<1) {gn.m2="00000"}
gnm2<-c(gnm2,gn.m2[[1]])
}


print(gnm1)

print(gnm2)

return(cbind(paired.reads[,1:5],Gene1=gnm1,paired.reads[,6:9],Gene2=gnm2,paired.reads[,10]
))

}
gn.paired<-get.genenames.lookup(tttt[1:10,])

gn.paired
                   CHR.M1  POS.M1      DIR.M1
 [1,] "8_1_3_659"  "chr2"  "140059033" "+"
 [2,] "8_1_3_558"  "chr6"  "164634640" "+"
 [3,] "8_1_3_228"  "chr18" "32347784"  "-"
 [4,] "8_1_3_1261" "chr19" "30576841"  "-"
 [5,] "8_1_3_908"  "chr10" "86479831"  "-"
 [6,] "8_1_3_53"   "chr2"  "237019866" "-"
 [7,] "8_1_3_1768" "chr12" "76487174"  "-"
 [8,] "8_1_3_1198" "chr7"  "136121868" "-"
 [9,] "8_1_4_851"  "chr10" "6255547"   "-"
[10,] "8_1_3_1942" "chr1"  "67658137"  "-"
      READ.M1                                              Gene1    CHR.M2
 [1,] "NTCTCCCTCCTTGATCCAGCATTATCAGCATCTGGCCCCACACATACTCT" "00000"  "chr1"
 [2,] "NGAAATGGATTTTTAAGCAGAATTTAAAGAAAGTATTCCAACTTAGATTT" "00000"  "chr10"
 [3,] "NTGATGTGTGACCAATGCTATGGTTTGAATGTTTGTCCCTTCTAAAACTC" "DTNA"   "chr18"
 [4,] "NAAAATCATGCCATCCTTGATTTTTCTGATTTCTTTTATTGTTTCTATTT" "00000"  "chr2"
 [5,] "NCAATTATGAACATATTCAACATCTCAACATCCTCCAAAGTTTCTTCTTG" "00000"  "chr19"
 [6,] "GGCATGCACTACCACGCTCAGCTGGGATCGGCCCTTTCTCCTCCCAGTTA" "AGAP1"  "chr9"
 [7,] "NAGTATACAGGGAAATTAGTTTCCAAATTGCTAATGCTTATTATCTCTTT" "00000"  "chr6"
 [8,] "GCAACAATGAAAAACCGTGTGGAGAGAGATCCAGCTGTCCCACGTCTTCT" "00000"  "chr7"
 [9,] "TGGCCCACAGGCTGGAAAAACAAGCAAGAAGGGTGTGTCATCAACCAAGT" "PFKFB3" "chr9"
[10,] "NCACAAATTTGAAGGCTATCCTTTAGGTAATGACTGTAAATAGAGATAAT" "IL23R"  "chr3"
      POS.M2      DIR.M2 READ.M2

 [1,] "67661301"  "+"
"CATAGTGATCTCACGCTGGTATGAGGAAAGAGGTAACACCGAAAAAATGC"
 [2,] "86483308"  "+"
"CTTCCCTGAGTTCTGTGTGCCGCTCTTGCAAATTACTCAAATATTAAGAA"
 [3,] "32351704"  "+"
"TTAAATAATCCTTTTTTATCAGTAGTCTTCTCTTCAGCTTTCCGATTTTT"
 [4,] "140055541" "-"
"GCTGTGTTAGGAAGAAATGTCTCAAACAACAACAAAAAGCCCTCAAGAGA"
 [5,] "30579696"  "+"
"CAGTGCCCAGGACAGAGGAAGCACCCTGTGAATGGTCGCCCCTGCTGGTA"
 [6,] "22299997"  "-"
"AAACAAGGATATCGATTTAACTACTATCTACACAGAAAGAGCACCTTCAT"
 [7,] "164630958" "-"
"TAAATGAGTGTGCTGAAATCATGGACTTCTTTATCATCATAACAGACAAG"
 [8,] "85529394"  "-"
"CAACATGGTGTTTGAAAGTATGTATATGTCATGAAAAGACTAAATTTAGC"
 [9,] "93926020"  "+"
"GAAGGCTCACGGGTTCTGACCTACACTCTGAAACTGACCAGTAGTGATAA"
[10,] "23593510"  "-"
"GATGTGGTTTCTGACGAGGACTCTTGGGGGCTTATAGATGGCTTCCCTCC"
      Gene2
 [1,] "IL23R"  "3669"
 [2,] "00000"  "3668"
 [3,] "DTNA"   "3583"
 [4,] "00000"  "3348"
 [5,] "00000"  "700"
 [6,] "00000"  "3603"
 [7,] "00000"  "3540"
 [8,] "00000"  "3529"
 [9,] "00000"  "2990"
[10,] "UBE2E2" "3866"

Best Regards,
Ramzi
----------------------------------------------------------------


On Tue, Dec 8, 2009 at 4:51 PM, James W. MacDonald <[email protected]>wrote:

> Hi Ramzi,
>
> Ramzi TEMANNI wrote:
>
>> Hi James,
>> I found out that I'm using UCSC hg19 ref but Biomart uses GRCh37 ref and
>> that's why the output generates many genes.
>> i'm trying to find out a way to convert coordinates but seems I have to
>> align again using the GRCh37 ref.
>>
>
> I don't think that is the problem, as UCSC hg19 is based on GRCh37. I think
> there might be a problem on the Biomart server side. As far as I can tell,
> the query string produced by biomaRt is correct, but the Biomart server
> returns more than I think it should.
>
> In fact, if you do the query for each chromosome individually, you either
> get one or zero genes returned. It's only if you do the query for more than
> one chromosomal region at a time that you get these extra genes:
>
> > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[3,1],
> t.cpd[3,2], t.cpd[3,2]+50), enseembl)
> [1] hgnc_symbol
> <0 rows> (or 0-length row.names)
> > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[4,1],
> t.cpd[4,2], t.cpd[4,2]+50), enseembl)
>  hgnc_symbol
> 1      MPPED2
> > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[5,1],
> t.cpd[5,2], t.cpd[5,2]+50), enseembl)
>  hgnc_symbol
> 1       REEP1
> > tmp <- getBM("hgnc_symbol", c("chromosome_name","start","end"),
> list(t.cpd[4:5,1], t.cpd[4:5,2], t.cpd[4:5,2]+50), enseembl)
> > dim(tmp)
> [1] 946   1
> > head(tmp)
>  hgnc_symbol
> 1       ZFP91
> 2        CNTF
> 3    C11orf76
> 4     RPLP0P2
> 5    C11orf64
> 6      OR4D8P
> > grep("REEP1", tmp[,1], value=T)
> [1] "REEP1"
>
> The XMLQuery that is sent to the Biomart server looks like this (if Rhoda
> Kinsella or Steffen Durinck are following this at all).
>
> Browse[2]> xmlQuery
> [1] "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query
> virtualSchemaName = 'default' uniqueRows = '1' count = '0'
> datasetConfigVersion = '0.6' requestid= \"biomaRt\"> <Dataset name =
> 'hsapiens_gene_ensembl'><Attribute name = 'hgnc_symbol'/><Filter name =
> 'chromosome_name' value = '11,2' /><Filter name = 'start' value =
> '30576841,86479831' /><Filter name = 'end' value = '30576891,86479881'
> /></Dataset></Query>"
>
> Best,
>
> Jim
>
>
>
>
>> Regards,
>> Ramzi
>> ----------------------------------------------------------------
>>
>>
>>
>> On Mon, Dec 7, 2009 at 4:32 PM, Ramzi TEMANNI 
>> <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>>    Hi James,
>>
>>    Thanks for your reply,
>>    As i have short reads of 50b, I've modified the code accordigly to
>>    your suggestion:
>>    gn.m1<-getBM(attributes= c("hgnc_symbol"),
>>           filters=c("chromosome_name","
>>    start",*"end"*),
>>
>>  
>> values=list(t.cpd[1:10,1],as.numeric(t.cpd[1:10,2]),*as.numeric(t.cpd[1:10,2])+50*),
>>    mart=ensembl)
>>
>>    But still having more genes than expected:
>>
>>    hgnc_symbol
>>    1      UBE2G1
>>    2      DUSP5P
>>    3   HIST3H2BA
>>    4     ZNF847P
>>    5     ACTBP11
>>    6     ZC3H11B
>>    ........
>>    8488     PPP2R2C
>>    8489        WFS1
>>    8490    SNORD73A
>>    8491     SNORA24
>>    8492     SNORA26
>>
>>    Did I miss something  ?
>>
>>    Thanks for the remark regarding the position, you are right ! I did
>>    not notice that, have to get back to the bowtie alignment file and
>>    see what is the reason behind that. I'm using bowtie with the last
>>    hg19 ref.
>>
>>    Thanks again for your comment.
>>    Best Regards,
>>    Ramzi
>>
>>    ----------------------------------------------------------------
>>
>>
>>
>>    On Mon, Dec 7, 2009 at 4:20 PM, James W. MacDonald
>>    <[email protected] <mailto:[email protected]>> wrote:
>>
>>        Hi Ramzi,
>>
>>
>>        Ramzi TEMANNI wrote:
>>
>>            Hi,
>>            I want to extract the gene names knowing the chromosome and
>>            the position for
>>            each genes:
>>
>>                t.cpd[1:10,1:2]
>>
>>                 CHR.M1 POS.M1
>>             [1,] "12"   "140059033"
>>             [2,] "19"   "164634640"
>>             [3,] "10"   "32347784"
>>             [4,] "11"   "30576841"
>>             [5,] "2"    "86479831"
>>             [6,] "12"   "237019866"
>>             [7,] "4"    "76487174"
>>             [8,] "20"   "136121868"
>>             [9,] "2"    "6255547"
>>            [10,] "1"    "67658137"
>>
>>            i use the following commands:
>>            library(biomaRt)
>>            mart = useMart("ensembl")
>>            ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)
>>            gn.m1<-getBM(attributes= c("hgnc_symbol"),
>>                  filters=c("chromosome_name","start"),
>>                  values=list(t.cpd[1:10,1],t.cpd[1:10,2]), mart=ensembl)
>>
>>            I'm expecting having a list of 10 genes names, but instead i
>>            get 8652 genes:
>>            hgnc_symbol
>>            1      OR2M1P
>>            2      OR2L1P
>>            3   HSD17B7P1
>>            4     OR14L1P
>>            5       OR2W5
>>            6       VN1R5
>>            ......
>>            8649        WFS1
>>            8650    SNORD73A
>>            8651     SNORA24
>>            8652     SNORA26
>>
>>            Did I miss something ?
>>
>>
>>        Yes. You are giving the start position, but not the end. Without
>>        explicitly telling the Biomart server where to stop looking for
>>        genes, where do you think it will stop by default?
>>
>>        Also, several of your coordinates are nonsensical. For instance,
>>        chr12 is only 133851859 bases long, chr20 is 63025520 bases
>>        long, etc.
>>
>>        Best,
>>
>>        Jim
>>
>>
>>
>>            Thanks in advance for your help
>>
>>            Best Regards,
>>            Ramzi
>>
>>
>>  ----------------------------------------------------------------
>>
>>                   [[alternative HTML version deleted]]
>>
>>            _______________________________________________
>>            Bioc-sig-sequencing mailing list
>>            [email protected]
>>            <mailto:[email protected]>
>>
>>            https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>        --        James W. MacDonald, M.S.
>>        Biostatistician
>>        Douglas Lab
>>        University of Michigan
>>        Department of Human Genetics
>>        5912 Buhl
>>        1241 E. Catherine St.
>>        Ann Arbor MI 48109-5618
>>        734-615-7826
>>        **********************************************************
>>        Electronic Mail is not secure, may not be read every day, and
>>        should not be used for urgent or sensitive issues
>>
>>
>>
>>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be
> used for urgent or sensitive issues
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] unexpected genes names list using getBM{biomaRt}

Reply via email to