Re: [mart-dev] BioMart output

Arek Kasprzyk Tue, 27 Feb 2007 06:43:19 -0800


On 27 Feb 2007, at 14:11, Ewan Birney wrote:

On 27 Feb 2007, at 14:04, Arek Kasprzyk wrote:
On 22 Feb 2007, at 15:48, Rosienne wrote:
Hi,
a few weeks ago I was attending an Open Door Workshop at theSanger. I had occasion to speak to one of your team and mention acouple of problems we regularly encounter when using biomart. I wasadvised to post to this address.
I, and my colleagues, use biomart to output gene related informationfor lists of microarray feature IDs. Even though we untick theensembl transcript ID box we still get an output for eachtranscript. In some cases, where genes have 9 documented transcriptswe get 9 perfectly replicated entries. When dealing with lists ofover a thousand genes each time this gets very confusing andgenerally makes excel stop responding!
We wonder if in future re-works of the tool a gene specific ratherthan a transcript specific output can be made available. We areaware that for people working on only one, or a handful of genes,getting all the transcript specific information is essential.However, it would make life a lot easier for scientists like us whohandle large gene lists if we could specifically select to obtainonly gene specific outputs, 1 gene = 1 row of output.
Dear Rosienne,
this particular problem is really specific to Ensembl data. Ensemblannotates on per transcript rather than on per gene basis whilemost people 'outside' seem to want the latter :) The ideal solutionwould be if 'per gene' annotation was provided at the source ofEnsembl annotation but failing that we are now looking for the waysof simply altering the output such that it will artificiallyintroduce 'per gene' annotationso that users like yourself would be able to avoid the annoyingrepetitions. You must be aware however that such approach has apotential of introducingconflicting annotation as it will be totally artificial. The correct'per gene' annotation can only be corrected at the source.
Woah guys - it cant be "corrected" at source - that's not the case.The "correct" annotation is at the transcriptlevel, which is what Ensembl provides and what gets Martified. Manypeople want:
(a) when results columns from Mart only has gene attributes, not toprovide the entirely redundant rowsof things being duplicated. This I think we have come up with asolution
(b) options to have transcript-level information concatenated intogene level reports as (perhaps)
comma separated lists
Both I think (having thought about this more) is better handledgenerically in the Mart View/output layerthan Ensembl. Ensembl has the _correct_ annotation structure(transcript orientated) it is just thatmost people want a _gene_ level view. This should be in the BioMartarea.
And - Arek - please stop characterising this as an Ensembl error.

Plus the fact that the majority of
the BioMart team is part of the Ensembl group most users don't careabout which side of BioMart or
ensembl this problem lies on, they just want it solved!

yes, the 'correct' was not a right word here :) Apologies for myenglish.

Ensembl is correct :) and per transcript annotation is correct as isany other data which goes into any mart.It is just not want users want in this particular instance - that'swhat I was trying to say.

Apologies to all the data providers, users etc.  No offense intended :)

Anyway, the problem remains the same. How to make the per transcriptannotationappear being on per gene basis. As I said before we are looking intothis right now andconsidering all the options. Should be coming with a solution soonwhich would work for Ensembljust the same as for any other mart trying to 'unify' data at a higherlevel

a.

Let's
(a) put in the simple sort by gene id, don't print rows redundant tothe previous
(should be easy, right?)
(b) discuss how to think about concatonation, ideally in tehsoftware, not the denormalisation
a.
Our second major problem stems from the fact that sometimes there isno information linked to particular microarray feature IDs. Thecount tab tells you how many out of your list were found but thereis no information whatsoever about the ones that were not found.Manually finding which 50 out of a list of 1000 were not found isnot easy. An output list of features not found, or inclusion of thenot found items within the output with a short 'not found' commentnext to them would be very useful.
In summary, for us the ideal situation would be if we could input alist of 1000 feature IDs and as output get a list of 1000 rows, 1gene per row, in the same sequence as the input list, with eitherempty cells or a not found comment against those not found.
Besides this particular feature, biomart is great and has made datamining of large data sets so much more accessible!
Thank you.


Regards
Rosienne
_______________________________________________________
Rosienne Farrugia
Division of Transfusion Medicine
Department of Haematology
University of Cambridge
Long Road
Cambridge
CB2 2PT

Tele: 01223 548008
Fax:  01223 548136
-------------------------------------------------------------------------------
Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468
-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

Arek Kasprzyk
EMBL-European Bioinformatics Institute.
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, UK.
Tel: +44-(0)1223-494606
Fax: +44-(0)1223-494468

-------------------------------------------------------------------------------

Re: [mart-dev] BioMart output

Reply via email to