Hi James,
you are right regarding the hg19/GrCh37 issue the problem was related to
string factor conversion to string but now is corrected.
as you said seems that BiomaRt did not formulate the query as i whish so I
finish by using a loop to get the requested result.
The following code works fine :
get.genenames.lookup <-function(paired.reads)
{
library(biomaRt)
mart = useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)
gnm1<-c()
gnm2<-c()
for (i in 1:dim(paired.reads)[1])
{
chr.name1<-as.numeric(substr(paired.reads[i,2], 4,
nchar(paired.reads[i,2])))
chr.start1<- as.numeric(paired.reads[i,3])
chr.end1<-chr.start1+50
chr.name2<-as.numeric(substr(paired.reads[i,6], 4,
nchar(paired.reads[i,6])))
chr.start2<- as.numeric(paired.reads[i,7])
chr.end2<-chr.start2+50
# transform coordinates in genes names
gn.m1<-getBM(attributes= c("hgnc_symbol"),
filters=c("chromosome_name","start","end"),
values=list(chr.name1,chr.start1,chr.end1), mart=ensembl)
if (length(gn.m1[[1]])<1) {gn.m1="00000"}
gnm1<-c(gnm1,gn.m1[[1]])
gn.m2<-getBM(attributes= c("hgnc_symbol"),
filters=c("chromosome_name","start","end"),
values=list(chr.name2,chr.start2,chr.end2), mart=ensembl)
if (length(gn.m2[[1]])<1) {gn.m2="00000"}
gnm2<-c(gnm2,gn.m2[[1]])
}
print(gnm1)
print(gnm2)
return(cbind(paired.reads[,1:5],Gene1=gnm1,paired.reads[,6:9],Gene2=gnm2,paired.reads[,10]
))
}
gn.paired<-get.genenames.lookup(tttt[1:10,])
gn.paired
CHR.M1 POS.M1 DIR.M1
[1,] "8_1_3_659" "chr2" "140059033" "+"
[2,] "8_1_3_558" "chr6" "164634640" "+"
[3,] "8_1_3_228" "chr18" "32347784" "-"
[4,] "8_1_3_1261" "chr19" "30576841" "-"
[5,] "8_1_3_908" "chr10" "86479831" "-"
[6,] "8_1_3_53" "chr2" "237019866" "-"
[7,] "8_1_3_1768" "chr12" "76487174" "-"
[8,] "8_1_3_1198" "chr7" "136121868" "-"
[9,] "8_1_4_851" "chr10" "6255547" "-"
[10,] "8_1_3_1942" "chr1" "67658137" "-"
READ.M1 Gene1 CHR.M2
[1,] "NTCTCCCTCCTTGATCCAGCATTATCAGCATCTGGCCCCACACATACTCT" "00000" "chr1"
[2,] "NGAAATGGATTTTTAAGCAGAATTTAAAGAAAGTATTCCAACTTAGATTT" "00000" "chr10"
[3,] "NTGATGTGTGACCAATGCTATGGTTTGAATGTTTGTCCCTTCTAAAACTC" "DTNA" "chr18"
[4,] "NAAAATCATGCCATCCTTGATTTTTCTGATTTCTTTTATTGTTTCTATTT" "00000" "chr2"
[5,] "NCAATTATGAACATATTCAACATCTCAACATCCTCCAAAGTTTCTTCTTG" "00000" "chr19"
[6,] "GGCATGCACTACCACGCTCAGCTGGGATCGGCCCTTTCTCCTCCCAGTTA" "AGAP1" "chr9"
[7,] "NAGTATACAGGGAAATTAGTTTCCAAATTGCTAATGCTTATTATCTCTTT" "00000" "chr6"
[8,] "GCAACAATGAAAAACCGTGTGGAGAGAGATCCAGCTGTCCCACGTCTTCT" "00000" "chr7"
[9,] "TGGCCCACAGGCTGGAAAAACAAGCAAGAAGGGTGTGTCATCAACCAAGT" "PFKFB3" "chr9"
[10,] "NCACAAATTTGAAGGCTATCCTTTAGGTAATGACTGTAAATAGAGATAAT" "IL23R" "chr3"
POS.M2 DIR.M2 READ.M2
[1,] "67661301" "+"
"CATAGTGATCTCACGCTGGTATGAGGAAAGAGGTAACACCGAAAAAATGC"
[2,] "86483308" "+"
"CTTCCCTGAGTTCTGTGTGCCGCTCTTGCAAATTACTCAAATATTAAGAA"
[3,] "32351704" "+"
"TTAAATAATCCTTTTTTATCAGTAGTCTTCTCTTCAGCTTTCCGATTTTT"
[4,] "140055541" "-"
"GCTGTGTTAGGAAGAAATGTCTCAAACAACAACAAAAAGCCCTCAAGAGA"
[5,] "30579696" "+"
"CAGTGCCCAGGACAGAGGAAGCACCCTGTGAATGGTCGCCCCTGCTGGTA"
[6,] "22299997" "-"
"AAACAAGGATATCGATTTAACTACTATCTACACAGAAAGAGCACCTTCAT"
[7,] "164630958" "-"
"TAAATGAGTGTGCTGAAATCATGGACTTCTTTATCATCATAACAGACAAG"
[8,] "85529394" "-"
"CAACATGGTGTTTGAAAGTATGTATATGTCATGAAAAGACTAAATTTAGC"
[9,] "93926020" "+"
"GAAGGCTCACGGGTTCTGACCTACACTCTGAAACTGACCAGTAGTGATAA"
[10,] "23593510" "-"
"GATGTGGTTTCTGACGAGGACTCTTGGGGGCTTATAGATGGCTTCCCTCC"
Gene2
[1,] "IL23R" "3669"
[2,] "00000" "3668"
[3,] "DTNA" "3583"
[4,] "00000" "3348"
[5,] "00000" "700"
[6,] "00000" "3603"
[7,] "00000" "3540"
[8,] "00000" "3529"
[9,] "00000" "2990"
[10,] "UBE2E2" "3866"
Best Regards,
Ramzi
----------------------------------------------------------------
On Tue, Dec 8, 2009 at 4:51 PM, James W. MacDonald <[email protected]>wrote:
> Hi Ramzi,
>
> Ramzi TEMANNI wrote:
>
>> Hi James,
>> I found out that I'm using UCSC hg19 ref but Biomart uses GRCh37 ref and
>> that's why the output generates many genes.
>> i'm trying to find out a way to convert coordinates but seems I have to
>> align again using the GRCh37 ref.
>>
>
> I don't think that is the problem, as UCSC hg19 is based on GRCh37. I think
> there might be a problem on the Biomart server side. As far as I can tell,
> the query string produced by biomaRt is correct, but the Biomart server
> returns more than I think it should.
>
> In fact, if you do the query for each chromosome individually, you either
> get one or zero genes returned. It's only if you do the query for more than
> one chromosomal region at a time that you get these extra genes:
>
> > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[3,1],
> t.cpd[3,2], t.cpd[3,2]+50), enseembl)
> [1] hgnc_symbol
> <0 rows> (or 0-length row.names)
> > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[4,1],
> t.cpd[4,2], t.cpd[4,2]+50), enseembl)
> hgnc_symbol
> 1 MPPED2
> > getBM("hgnc_symbol", c("chromosome_name","start","end"), list(t.cpd[5,1],
> t.cpd[5,2], t.cpd[5,2]+50), enseembl)
> hgnc_symbol
> 1 REEP1
> > tmp <- getBM("hgnc_symbol", c("chromosome_name","start","end"),
> list(t.cpd[4:5,1], t.cpd[4:5,2], t.cpd[4:5,2]+50), enseembl)
> > dim(tmp)
> [1] 946 1
> > head(tmp)
> hgnc_symbol
> 1 ZFP91
> 2 CNTF
> 3 C11orf76
> 4 RPLP0P2
> 5 C11orf64
> 6 OR4D8P
> > grep("REEP1", tmp[,1], value=T)
> [1] "REEP1"
>
> The XMLQuery that is sent to the Biomart server looks like this (if Rhoda
> Kinsella or Steffen Durinck are following this at all).
>
> Browse[2]> xmlQuery
> [1] "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query
> virtualSchemaName = 'default' uniqueRows = '1' count = '0'
> datasetConfigVersion = '0.6' requestid= \"biomaRt\"> <Dataset name =
> 'hsapiens_gene_ensembl'><Attribute name = 'hgnc_symbol'/><Filter name =
> 'chromosome_name' value = '11,2' /><Filter name = 'start' value =
> '30576841,86479831' /><Filter name = 'end' value = '30576891,86479881'
> /></Dataset></Query>"
>
> Best,
>
> Jim
>
>
>
>
>> Regards,
>> Ramzi
>> ----------------------------------------------------------------
>>
>>
>>
>> On Mon, Dec 7, 2009 at 4:32 PM, Ramzi TEMANNI
>> <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>> Hi James,
>>
>> Thanks for your reply,
>> As i have short reads of 50b, I've modified the code accordigly to
>> your suggestion:
>> gn.m1<-getBM(attributes= c("hgnc_symbol"),
>> filters=c("chromosome_name","
>> start",*"end"*),
>>
>>
>> values=list(t.cpd[1:10,1],as.numeric(t.cpd[1:10,2]),*as.numeric(t.cpd[1:10,2])+50*),
>> mart=ensembl)
>>
>> But still having more genes than expected:
>>
>> hgnc_symbol
>> 1 UBE2G1
>> 2 DUSP5P
>> 3 HIST3H2BA
>> 4 ZNF847P
>> 5 ACTBP11
>> 6 ZC3H11B
>> ........
>> 8488 PPP2R2C
>> 8489 WFS1
>> 8490 SNORD73A
>> 8491 SNORA24
>> 8492 SNORA26
>>
>> Did I miss something ?
>>
>> Thanks for the remark regarding the position, you are right ! I did
>> not notice that, have to get back to the bowtie alignment file and
>> see what is the reason behind that. I'm using bowtie with the last
>> hg19 ref.
>>
>> Thanks again for your comment.
>> Best Regards,
>> Ramzi
>>
>> ----------------------------------------------------------------
>>
>>
>>
>> On Mon, Dec 7, 2009 at 4:20 PM, James W. MacDonald
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>> Hi Ramzi,
>>
>>
>> Ramzi TEMANNI wrote:
>>
>> Hi,
>> I want to extract the gene names knowing the chromosome and
>> the position for
>> each genes:
>>
>> t.cpd[1:10,1:2]
>>
>> CHR.M1 POS.M1
>> [1,] "12" "140059033"
>> [2,] "19" "164634640"
>> [3,] "10" "32347784"
>> [4,] "11" "30576841"
>> [5,] "2" "86479831"
>> [6,] "12" "237019866"
>> [7,] "4" "76487174"
>> [8,] "20" "136121868"
>> [9,] "2" "6255547"
>> [10,] "1" "67658137"
>>
>> i use the following commands:
>> library(biomaRt)
>> mart = useMart("ensembl")
>> ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)
>> gn.m1<-getBM(attributes= c("hgnc_symbol"),
>> filters=c("chromosome_name","start"),
>> values=list(t.cpd[1:10,1],t.cpd[1:10,2]), mart=ensembl)
>>
>> I'm expecting having a list of 10 genes names, but instead i
>> get 8652 genes:
>> hgnc_symbol
>> 1 OR2M1P
>> 2 OR2L1P
>> 3 HSD17B7P1
>> 4 OR14L1P
>> 5 OR2W5
>> 6 VN1R5
>> ......
>> 8649 WFS1
>> 8650 SNORD73A
>> 8651 SNORA24
>> 8652 SNORA26
>>
>> Did I miss something ?
>>
>>
>> Yes. You are giving the start position, but not the end. Without
>> explicitly telling the Biomart server where to stop looking for
>> genes, where do you think it will stop by default?
>>
>> Also, several of your coordinates are nonsensical. For instance,
>> chr12 is only 133851859 bases long, chr20 is 63025520 bases
>> long, etc.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> Thanks in advance for your help
>>
>> Best Regards,
>> Ramzi
>>
>>
>> ----------------------------------------------------------------
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> <mailto:[email protected]>
>>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>> -- James W. MacDonald, M.S.
>> Biostatistician
>> Douglas Lab
>> University of Michigan
>> Department of Human Genetics
>> 5912 Buhl
>> 1241 E. Catherine St.
>> Ann Arbor MI 48109-5618
>> 734-615-7826
>> **********************************************************
>> Electronic Mail is not secure, may not be read every day, and
>> should not be used for urgent or sensitive issues
>>
>>
>>
>>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be
> used for urgent or sensitive issues
>
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing