Hi Jim,

I think we should choose the biomaRt model, that is, duplicated are
allowed but silently ignored.

Note that this is also the SQL model. When you do

  SELECT * FROM ... WHERE key IN c('key1', 'key2', ...)

duplicated keys don't generate duplicates in the output.

Also note that, like SELECT, even if the keys supplied to
biomaRt::getBM() (via the 'values' arg) don't contain duplicates
and if all the mappings are 1-to-1, biomaRt::getBM() is not
guarantee to preserve order.

Generally speaking having duplicates in the input produce duplicates
in the output is useful in vectorized operations when the output
is expected to be parallel to the input. Vectorized operations also
need to propagate NAs and to preserve order. However, like SELECT
and biomaRt::getBM(), select() cannot produce an output that is
parallel to the input *in general*.

It seems that the current philosophy for select() is to emit a note
or a warning every time the output is not parallel to the input.
Personally I find this too noisy and not that useful.


On 11/20/2015 02:30 PM, James W. MacDonald wrote:
There is an inconsistency in how select() works in AnnotationDbi when a
user passes in duplicated keys to be mapped, depending on if the mapping is
1:1 or 1:many. It's easiest to show using an example.

select(org.Hs.eg.db, rep("1", 3), "SYMBOL")
'select()' returned many:1 mapping between keys and columns
1        1   A1BG
2        1   A1BG
3        1   A1BG

select(org.Hs.eg.db, rep("1", 3), "GO")
'select()' returned many:many mapping between keys and columns
1        1 GO:0003674       ND       MF
2        1 GO:0003674       ND       MF
3        1 GO:0003674       ND       MF

This is obviously a bug. A single query for that ID results in this:

select(org.Hs.eg.db, "1", "GO")
'select()' returned 1:many mapping between keys and columns
1        1 GO:0003674       ND       MF
2        1 GO:0005576      IDA       CC
3        1 GO:0005615      IDA       CC
4        1 GO:0008150       ND       BP
5        1 GO:0070062      IDA       CC
6        1 GO:0072562      IDA       CC

So the returned results are completely borked.

However, the question I have is what should be returned? To be consistent
with the first example, it should be the output expected for a single key,
repeated three times, which I have patched AnnotationDbi to do:

select(org.Hs.eg.db, rep("1", 3), "GO")
'select()' returned many:many mapping between keys and columns
1         1 GO:0003674       ND       MF
2         1 GO:0005576      IDA       CC
3         1 GO:0005615      IDA       CC
4         1 GO:0008150       ND       BP
5         1 GO:0070062      IDA       CC
6         1 GO:0072562      IDA       CC
7         1 GO:0003674       ND       MF
8         1 GO:0005576      IDA       CC
9         1 GO:0005615      IDA       CC
10        1 GO:0008150       ND       BP
11        1 GO:0070062      IDA       CC
12        1 GO:0072562      IDA       CC
13        1 GO:0003674       ND       MF
14        1 GO:0005576      IDA       CC
15        1 GO:0005615      IDA       CC
16        1 GO:0008150       ND       BP
17        1 GO:0070062      IDA       CC
18        1 GO:0072562      IDA       CC

So, two questions.

    1. Should duplicate keys be allowed, or should duplicates be removed
    before querying the database, preferably with a message saying that dups
    were removed?
    2. If the answer to #1 is yes, then to be consistent, I will just commit
    the patch I have made to both devel and release.



