Re: [Bioc-sig-seq] unexpected genes names list using getBM{biomaRt}

James W. MacDonald Tue, 08 Dec 2009 07:53:04 -0800

Hi Ramzi,

Ramzi TEMANNI wrote:

Hi James,
I found out that I'm using UCSC hg19 ref but Biomart uses GRCh37 ref andthat's why the output generates many genes.i'm trying to find out a way to convert coordinates but seems I have toalign again using the GRCh37 ref.

I don't think that is the problem, as UCSC hg19 is based on GRCh37. Ithink there might be a problem on the Biomart server side. As far as Ican tell, the query string produced by biomaRt is correct, but theBiomart server returns more than I think it should.

In fact, if you do the query for each chromosome individually, youeither get one or zero genes returned. It's only if you do the query formore than one chromosomal region at a time that you get these extra genes:

> getBM("hgnc_symbol", c("chromosome_name","start","end"),list(t.cpd[3,1], t.cpd[3,2], t.cpd[3,2]+50), enseembl)

[1] hgnc_symbol
<0 rows> (or 0-length row.names)

> getBM("hgnc_symbol", c("chromosome_name","start","end"),list(t.cpd[4,1], t.cpd[4,2], t.cpd[4,2]+50), enseembl)

  hgnc_symbol
1      MPPED2

> getBM("hgnc_symbol", c("chromosome_name","start","end"),list(t.cpd[5,1], t.cpd[5,2], t.cpd[5,2]+50), enseembl)

  hgnc_symbol
1       REEP1

> tmp <- getBM("hgnc_symbol", c("chromosome_name","start","end"),list(t.cpd[4:5,1], t.cpd[4:5,2], t.cpd[4:5,2]+50), enseembl)

> dim(tmp)
[1] 946   1
> head(tmp)
  hgnc_symbol
1       ZFP91
2        CNTF
3    C11orf76
4     RPLP0P2
5    C11orf64
6      OR4D8P
> grep("REEP1", tmp[,1], value=T)
[1] "REEP1"

The XMLQuery that is sent to the Biomart server looks like this (ifRhoda Kinsella or Steffen Durinck are following this at all).


Browse[2]> xmlQuery

[1] "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><QueryvirtualSchemaName = 'default' uniqueRows = '1' count = '0'datasetConfigVersion = '0.6' requestid= \"biomaRt\"> <Dataset name ='hsapiens_gene_ensembl'><Attribute name = 'hgnc_symbol'/><Filter name ='chromosome_name' value = '11,2' /><Filter name = 'start' value ='30576841,86479831' /><Filter name = 'end' value = '30576891,86479881'/></Dataset></Query>"


Best,

Jim


Regards,
Ramzi
----------------------------------------------------------------

On Mon, Dec 7, 2009 at 4:32 PM, Ramzi TEMANNI <[email protected]<mailto:[email protected]>> wrote:


    Hi James,

    Thanks for your reply,
    As i have short reads of 50b, I've modified the code accordigly to
    your suggestion:
    gn.m1<-getBM(attributes= c("hgnc_symbol"),
           filters=c("chromosome_name","
    start",*"end"*),

values=list(t.cpd[1:10,1],as.numeric(t.cpd[1:10,2]),*as.numeric(t.cpd[1:10,2])+50*),

    mart=ensembl)

    But still having more genes than expected:

    hgnc_symbol
    1      UBE2G1
    2      DUSP5P
    3   HIST3H2BA
    4     ZNF847P
    5     ACTBP11
    6     ZC3H11B
    ........
    8488     PPP2R2C
    8489        WFS1
    8490    SNORD73A
    8491     SNORA24
    8492     SNORA26

    Did I miss something  ?

    Thanks for the remark regarding the position, you are right ! I did
    not notice that, have to get back to the bowtie alignment file and
    see what is the reason behind that. I'm using bowtie with the last
    hg19 ref.

Thanks again for your comment.

    Best Regards,
    Ramzi

    ----------------------------------------------------------------



    On Mon, Dec 7, 2009 at 4:20 PM, James W. MacDonald
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Ramzi,


        Ramzi TEMANNI wrote:

            Hi,
            I want to extract the gene names knowing the chromosome and
            the position for
            each genes:

                t.cpd[1:10,1:2]

                 CHR.M1 POS.M1
             [1,] "12"   "140059033"
             [2,] "19"   "164634640"
             [3,] "10"   "32347784"
             [4,] "11"   "30576841"
             [5,] "2"    "86479831"
             [6,] "12"   "237019866"
             [7,] "4"    "76487174"
             [8,] "20"   "136121868"
             [9,] "2"    "6255547"
            [10,] "1"    "67658137"

            i use the following commands:
            library(biomaRt)
            mart = useMart("ensembl")
            ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)
            gn.m1<-getBM(attributes= c("hgnc_symbol"),
                  filters=c("chromosome_name","start"),
                  values=list(t.cpd[1:10,1],t.cpd[1:10,2]), mart=ensembl)

            I'm expecting having a list of 10 genes names, but instead i
            get 8652 genes:
            hgnc_symbol
            1      OR2M1P
            2      OR2L1P
            3   HSD17B7P1
            4     OR14L1P
            5       OR2W5
            6       VN1R5
            ......
            8649        WFS1
            8650    SNORD73A
            8651     SNORA24
            8652     SNORA26

            Did I miss something ?


        Yes. You are giving the start position, but not the end. Without
        explicitly telling the Biomart server where to stop looking for
        genes, where do you think it will stop by default?

        Also, several of your coordinates are nonsensical. For instance,
        chr12 is only 133851859 bases long, chr20 is 63025520 bases
        long, etc.

        Best,

        Jim



            Thanks in advance for your help

            Best Regards,
            Ramzi

            ----------------------------------------------------------------

                   [[alternative HTML version deleted]]

            _______________________________________________
            Bioc-sig-sequencing mailing list
            [email protected]
            <mailto:[email protected]>
            https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

--James W. MacDonald, M.S.

        Biostatistician
        Douglas Lab
        University of Michigan
        Department of Human Genetics
        5912 Buhl
        1241 E. Catherine St.
        Ann Arbor MI 48109-5618
        734-615-7826
        **********************************************************
        Electronic Mail is not secure, may not be read every day, and
        should not be used for urgent or sensitive issues


--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************

Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] unexpected genes names list using getBM{biomaRt}

Reply via email to