Hi Arek,
Those future developments sound interesting. Thanks for the info.
Instead of the DISTINCT, what about a simple hash-based filter implemented
just before outputing the data, e.g.
use Digest::MD5 qw(md5_hex);
my %seen;
foreach $row (@rows_to_output) {
print "$row\n" unless ($seen{md5_hex($row)}++);
}
in this example $row has to be a text string (not array ref), of course
of course, "sort -u" on a text mart export does almost the same
cheers,
Bob.
Arek Kasprzyk writes:
>
> On 26 Jan 2007, at 15:12, Bob MacCallum wrote:
>
> >
> > Hi,
> >
> > While we're talking about the results section. I've wondered if a
> > "unique
> > records only" option could be provided - to the average biologist
> > user, the
> > following query brings back duplicated genes (because the PFAM domains
> > are
> > features of transcripts, which I have deselected from the output
> > attributes).
> >
> >
> > <Query virtualSchemaName = "default" Header = "1" count = ""
> > softwareVersion = "0.5" >
> >
> > <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
> > <Attribute name = "ensembl_gene_id" />
> > <Filter name = "pfam" value = "PF00169"/>
> > </Dataset>
> > </Query>
> >
> >
> > However, we can leave the default gene + transcript attributes and
> > instead
> > provide two PFAM ids (that I know are sometimes in the same protein).
> > Then
> > the results again contain some duplicate records (although adding the
> > PFAM id
> > output attribute would fix this of course).
> >
> >
> > <Query virtualSchemaName = "default" Header = "1" count = ""
> > softwareVersion = "0.5" >
> >
> > <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
> > <Attribute name = "ensembl_gene_id" />
> > <Attribute name = "ensembl_transcript_id" />
> > <Filter name = "pfam" value = "PF00169,PF00017"/>
> > </Dataset>
> > </Query>
> >
> > snippet:
> > ENSG00000102010 ENST00000342014
> > ENSG00000102010 ENST00000342014
> > ENSG00000102010 ENST00000348343
> > ENSG00000102010 ENST00000348343
> > ENSG00000102010 ENST00000357607
> > ENSG00000102010 ENST00000357607
> > ENSG00000102010 ENST00000380391
> > ENSG00000102010 ENST00000380391
> >
> >
> > I note that the gene count ("count" button) is always correct however.
> >
> >
> > What do people think?
> >
> > cheers,
> > Bob.
> >
>
> Hi Bob,
> yes, this request has come up several times and finally we need to give
> in :)
>
> As you correctly pointed out the transcript rather than gene level
> annotation is a
> feature of Ensembl data and as Ensembl is likely to stick with this
> for a foreseeable
> future so we will have to add a 'fix' on our side. Unfortunately this
> cannot be as simple as 'distinct' as this
> can have grave consequences on the performance on large datasets
> and in particular with certain combination of filters. We will be able
> however to provide an alteration to the mart structure such that it
> will artificially provide such annotation at a higher lever which will
> be an equivalent of 'distinct' but without a performance hit. We are
> now in
> the process of implementing this in MBuilder so future Ensembl mart
> releases should have this 'fix'. This will of course work with other
> 'non-Ensembl'
> data as well
>
>
> a.
>
>
>
> >
> >
> > Arek Kasprzyk writes:
> >>
> >> On 26 Jan 2007, at 14:43, David Croft wrote:
> >>
> >>> Hi Arek,
> >>>
> >>>> Sounds like a good suggestion, we can consider that. At the moment
> >>>> you can only ask for first 10, 20, 50 .... 200 or the whole lot but
> >>>> not
> >>>> the pagination that (I think) you have in mind
> >>>
> >>> Yes, that's right - similar to what you get when you go to Google.
> >>> It would be kind of cool if the page displayed the total count of
> >>> results and told you that you are on page 3 of 28 (or whatever)
> >>> and gave you buttons to go back a page or forward a page.
> >>>
> >>
> >> not sure about this :) we could certainly add 'next' 'back' or
> >> equivalents
> >> but the total count and would be a bit more problematic. We do not
> >> have
> >> all the results during preview yet and the total count tend to be
> >> often
> >> expensive so we do not do it as default. Let us try to think about at
> >> least
> >> some of it
> >>
> >> a.
> >>
> >>
> >>> Cheers,
> >>>
> >>> David.
> >>>
> >>>
> >>
> >>
> >> ----------------------------------------------------------------------
> >> --
> >> -------
> >> Arek Kasprzyk
> >> EMBL-European Bioinformatics Institute.
> >> Wellcome Trust Genome Campus, Hinxton,
> >> Cambridge CB10 1SD, UK.
> >> Tel: +44-(0)1223-494606
> >> Fax: +44-(0)1223-494468
> >> ----------------------------------------------------------------------
> >> --
> >> -------
> >>
> >>
> >>
> >
> > --
> > Bob MacCallum | VectorBase Developer | Kafatos/Christophides Groups |
> > Division of Cell and Molecular Biology | Imperial College London |
> > Phone +442075941945 | Email [EMAIL PROTECTED]
> >
>
>
> ------------------------------------------------------------------------
> -------
> Arek Kasprzyk
> EMBL-European Bioinformatics Institute.
> Wellcome Trust Genome Campus, Hinxton,
> Cambridge CB10 1SD, UK.
> Tel: +44-(0)1223-494606
> Fax: +44-(0)1223-494468
> ------------------------------------------------------------------------
> -------
>
>
>
--
Bob MacCallum | VectorBase Developer | Kafatos/Christophides Groups |
Division of Cell and Molecular Biology | Imperial College London |
Phone +442075941945 | Email [EMAIL PROTECTED]