-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Interesting idea, but unfortunately it wouldn't work, as hash codes are
not unique. There is a chance that two rows of a dataset could produce
the same hash even though they were different rows, and hence some
distinct results would be treated as duplicates and all but the first
one sharing the same hash would get dropped. Admittedly chances are slim
that this would happen as the range of available hash codes is very
large, but the possibility exists and I'm sure it wouldn't be long
before users complained about mysteriously missing results!

The only way to get truly distinct rows is to record every attribute of
every row in a memory hash and do a unique sort on it within the API,
which is memory inefficient, or do a select distinct query within the
database, which is time-inefficient.

cheers,
Richard

Bob MacCallum wrote:
> 
> Hi Arek,
> 
> Those future developments sound interesting.  Thanks for the info.
> 
> 
> Instead of the DISTINCT, what about a simple hash-based filter implemented
> just before outputing the data, e.g.
> 
> use Digest::MD5 qw(md5_hex);
> my %seen;
> foreach $row (@rows_to_output) {
>   print "$row\n" unless ($seen{md5_hex($row)}++);
> }
> 
> in this example $row has to be a text string (not array ref), of course
> 
> of course, "sort -u" on a text mart export does almost the same
> 
> cheers,
> Bob.
> 
> Arek Kasprzyk writes:
>  > 
>  > On 26 Jan 2007, at 15:12, Bob MacCallum wrote:
>  > 
>  > >
>  > > Hi,
>  > >
>  > > While we're talking about the results section.  I've wondered if a  
>  > > "unique
>  > > records only" option could be provided - to the average biologist  
>  > > user, the
>  > > following query brings back duplicated genes (because the PFAM domains  
>  > > are
>  > > features of transcripts, which I have deselected from the output  
>  > > attributes).
>  > >
>  > >
>  > > <Query  virtualSchemaName = "default" Header = "1" count = ""  
>  > > softwareVersion = "0.5" >
>  > >                  
>  > >          <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
>  > >                  <Attribute name = "ensembl_gene_id" />
>  > >                  <Filter name = "pfam" value = "PF00169"/>
>  > >          </Dataset>
>  > > </Query>
>  > >
>  > >
>  > > However, we can leave the default gene + transcript attributes and  
>  > > instead
>  > > provide two PFAM ids (that I know are sometimes in the same protein).   
>  > > Then
>  > > the results again contain some duplicate records (although adding the  
>  > > PFAM id
>  > > output attribute would fix this of course).
>  > >
>  > >
>  > > <Query  virtualSchemaName = "default" Header = "1" count = ""  
>  > > softwareVersion = "0.5" >
>  > >                  
>  > >          <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
>  > >                  <Attribute name = "ensembl_gene_id" />
>  > >                  <Attribute name = "ensembl_transcript_id" />
>  > >                  <Filter name = "pfam" value = "PF00169,PF00017"/>
>  > >          </Dataset>
>  > > </Query>
>  > >
>  > > snippet:
>  > > ENSG00000102010 ENST00000342014
>  > > ENSG00000102010 ENST00000342014
>  > > ENSG00000102010 ENST00000348343
>  > > ENSG00000102010 ENST00000348343
>  > > ENSG00000102010 ENST00000357607
>  > > ENSG00000102010 ENST00000357607
>  > > ENSG00000102010 ENST00000380391
>  > > ENSG00000102010 ENST00000380391
>  > >
>  > >
>  > > I note that the gene count ("count" button) is always correct however.
>  > >
>  > >
>  > > What do people think?
>  > >
>  > > cheers,
>  > > Bob.
>  > >
>  > 
>  > Hi Bob,
>  > yes, this request has come up several times and finally we need to give  
>  > in :)
>  > 
>  > As you correctly pointed out the transcript rather than gene level  
>  > annotation  is a
>  > feature of Ensembl data and  as Ensembl is likely to stick with this  
>  > for a foreseeable
>  > future so we will have to add a 'fix' on our side. Unfortunately this  
>  > cannot be as simple as 'distinct' as this
>  > can have grave consequences on the performance on large datasets
>  > and in particular with certain combination of filters. We will be able
>  > however to provide an alteration to the mart structure such that it
>  > will artificially provide such annotation at a higher lever which will
>  > be an equivalent of 'distinct' but without a performance hit. We are  
>  > now in
>  > the process of implementing this in MBuilder so future Ensembl mart
>  > releases should have this 'fix'. This will of course work with other  
>  > 'non-Ensembl'
>  > data as well
>  > 
>  > 
>  > a.
>  > 
>  > 
>  > 
>  > >
>  > >
>  > > Arek Kasprzyk writes:
>  > >>
>  > >> On 26 Jan 2007, at 14:43, David Croft wrote:
>  > >>
>  > >>> Hi Arek,
>  > >>>
>  > >>>> Sounds like a good suggestion, we can consider that. At the moment
>  > >>>> you can only ask for first 10, 20, 50 .... 200 or the whole lot but
>  > >>>> not
>  > >>>> the pagination that (I think) you have in mind
>  > >>>
>  > >>> Yes, that's right - similar to what you get when you go to Google.
>  > >>> It would be kind of cool if the page displayed the total count of
>  > >>> results and told you that you are on page 3 of 28 (or whatever)
>  > >>> and gave you buttons to go back a page or forward a page.
>  > >>>
>  > >>
>  > >> not sure about this :) we could certainly add 'next' 'back' or
>  > >> equivalents
>  > >> but the total count and would be a bit more problematic. We do not  
>  > >> have
>  > >> all the results during preview yet and the total count tend to be  
>  > >> often
>  > >> expensive so we do not do it as default. Let us try to think about at
>  > >> least
>  > >> some of it
>  > >>
>  > >> a.
>  > >>
>  > >>
>  > >>> Cheers,
>  > >>>
>  > >>> David.
>  > >>>
>  > >>>
>  > >>
>  > >>
>  > >> ---------------------------------------------------------------------- 
>  > >> --
>  > >> -------
>  > >> Arek Kasprzyk
>  > >> EMBL-European Bioinformatics Institute.
>  > >> Wellcome Trust Genome Campus, Hinxton,
>  > >> Cambridge CB10 1SD, UK.
>  > >> Tel: +44-(0)1223-494606
>  > >> Fax: +44-(0)1223-494468
>  > >> ---------------------------------------------------------------------- 
>  > >> --
>  > >> -------
>  > >>
>  > >>
>  > >>
>  > >
>  > > -- 
>  > > Bob MacCallum | VectorBase Developer | Kafatos/Christophides Groups |
>  > > Division of Cell and Molecular Biology | Imperial College London |
>  > > Phone +442075941945 | Email [EMAIL PROTECTED]
>  > >
>  > 
>  > 
>  > ------------------------------------------------------------------------ 
>  > -------
>  > Arek Kasprzyk
>  > EMBL-European Bioinformatics Institute.
>  > Wellcome Trust Genome Campus, Hinxton,
>  > Cambridge CB10 1SD, UK.
>  > Tel: +44-(0)1223-494606
>  > Fax: +44-(0)1223-494468
>  > ------------------------------------------------------------------------ 
>  > -------
>  > 
>  > 
>  > 
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFujSn4C5LeMEKA/QRAqeIAJ0ZvkC2tPuRXM8omjiEYZLSGvENFACeL1Aa
+XaHUGnZQuwsSp3qSeaJdaY=
=DUwz
-----END PGP SIGNATURE-----

Reply via email to