Hi

My two cents:

On 04/06/15 19:50, James W. MacDonald wrote:
In other words, for me it is a common practice to do something like this:

fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns

I add in the step where dups are removed because I already know they are
there. But a naive user might instead do

fit2$genes <- select(<chippackage>, featureNames(eset),
c("ENTREZID","SYMBOL"))

I'm not even that happy with James' first solution, as it relies on the order being correct after removing the duplicates. I'd feel safer to use 'match' to ensure that. (What if an EntrezId is not found in the Annotation DB? Will we have a line with NA, or is the line simply missing? The latter would break James' code.)

What users really want here is a way to get the "preferred" symbol for an entrezId, and for lack of this, they accept simply a random one or the first one (in some unspecified collation). So, we should have a function, maybe 'select1', to select one and only one hit for each query value.

  select1(x, keys, columns, keytype, requireUnique=FALSE, ... )

This would query the AnnotationDbi object 'x' as does 'select', but return a data frame with the columns specified in 'columns', and the vector that was passed as 'keys' as row names, thus guaranteeing that each line in the data frame corresponds to one query key. If there were multiple records for a key, the first one is used, unless 'requireUnique' is set, in which case an error is issued. And if no record is present for a key, the data frame contains a row of NAs for this key.

This would be quite convenient for any kind of ID conversion issues.

  Simon

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to