Hi Rosienne, I'm a user, not a developer, but we've used biomart for a similar purpose and can sympathise with you.
The developers will almost certainly reply that biomart is tool for a 1001 purposes of which microarray annotation is just one. Remember that you're refering to the Ensembl biomarts which is again one of many marts. Some may be gene based, some may even be microarray reporter based if you are lucky (then you'll get what you were asking for automatically). However, I think you can work around these problems. You should be able to maintain a master sheet of 1000 rows in Excel - open up another sheet and load in the mart output (with the microarray feature left-most) - then use VLOOKUP to pull across the data you need into the master sheet. It will put something like #&!%*VAL where there is no match. Duplicate lines in mart output will probably not be a problem, but if they are you can use the unix sort tool with the -u unique option maybe. You're always going to run into problems with multiple output for single genes however. E.g. interpro domains or GO terms. It will be difficult to get Excel to concatenate these on one line - maybe impossible I'm not sure - I'm not an Excel expert. In the end you'll need to decide exactly what you want and write a small script to do it - and that's a specialist job that biomart certainly isn't made to do. cheers, Bob. Rosienne writes: > Hi, > > a few weeks ago I was attending an Open Door Workshop at the Sanger. I > had occasion to speak to one of your team and mention a couple of > problems we regularly encounter when using biomart. I was advised to > post to this address. > > > I, and my colleagues, use biomart to output gene related information > for lists of microarray feature IDs. Even though we untick the ensembl > transcript ID box we still get an output for each transcript. In some > cases, where genes have 9 documented transcripts we get 9 perfectly > replicated entries. When dealing with lists of over a thousand genes > each time this gets very confusing and generally makes excel stop > responding! > > We wonder if in future re-works of the tool a gene specific rather than > a transcript specific output can be made available. We are aware that > for people working on only one, or a handful of genes, getting all the > transcript specific information is essential. However, it would make > life a lot easier for scientists like us who handle large gene lists > if we could specifically select to obtain only gene specific outputs, 1 > gene = 1 row of output. > > Our second major problem stems from the fact that sometimes there is no > information linked to particular microarray feature IDs. The count tab > tells you how many out of your list were found but there is no > information whatsoever about the ones that were not found. Manually > finding which 50 out of a list of 1000 were not found is not easy. An > output list of features not found, or inclusion of the not found items > within the output with a short 'not found' comment next to them would > be very useful. > > > In summary, for us the ideal situation would be if we could input a > list of 1000 feature IDs and as output get a list of 1000 rows, 1 gene > per row, in the same sequence as the input list, with either empty > cells or a not found comment against those not found. > > > > Besides this particular feature, biomart is great and has made data > mining of large data sets so much more accessible! > Thank you. > > > Regards > Rosienne > _______________________________________________________ > Rosienne Farrugia > Division of Transfusion Medicine > Department of Haematology > University of Cambridge > Long Road > Cambridge > CB2 2PT > > Tele: 01223 548008 > Fax: 01223 548136 > -- Bob MacCallum | VectorBase Developer | Kafatos/Christophides Groups | Division of Cell and Molecular Biology | Imperial College London | Phone +442075941945 | Email [EMAIL PROTECTED]
