Hi all,

In avro there is a limitation to allow only strings as map keys:
http://avro.apache.org/docs/current/spec.html#Maps

I have experienced some suffering with it, also I have found several
emails in mailing list about this, some tickets (e.g. AVRO-1147), one of
those is feature proposal:
[AVRO-680](https://issues.apache.org/jira/browse/AVRO-680).

In my use-case there are thrift objects that should be converted to avro
and these thrift-objects use different types in map keys (which is by
the way fine for Thrift, C++, C#, java, js, perl, php, pythons & ruby).
So, in case with automatic thrift->avro conversion the converter just
throws exception on conversion:
`main/java/org/apache/avro/thrift/ThriftData.java:222`. So assuming
thrift-objects cannot be changed, building some work around seems to be
really wrong and ugly, at least before it is not cleared what are the
reasons of those restrictions...

I am really curious to find out why it was done so... (and also make it
better).

So, I have looked this up and found
[AVRO-9](https://issues.apache.org/jira/browse/AVRO-9). I have
interpreted the reasons to have this restriction as:
1. Easiness of integration with the standard map datastructure of "many
scripting languages".
2. Implementation simplification as dynamic records, where key name is
mapped to field name from instance to instance.

I have found also unanswered email about reason 1:
http://search-hadoop.com/m/J08Te2HvNbT1

So, I am really concerned about "many scripting languages", especially,
if reduce all of them to subset of those that avro is supporting after
some years of project life (and plan to support in future).

I have checked next languages using repl.it, http://codepad.org/ and
http://hyperpolyglot.org/scripting,
and found that it is possible to use at least int and float there as a
map key:
* ruby
* php
* pythons
* js
* perl

So, it doesn't look like an argument anymore, while the absence of this
feature still makes me and some other people suffer, according to emails
and Jira-tickets.

Also, it looks, that there was similar limitation in Cassandra and they
[got rid of it](https://issues.apache.org/jira/browse/CASSANDRA-767)

I have worked some time with thrift and I have not experienced any
problems with integers/shorts in map keys (except from thrift->avro
conversion). And the benefit of saving some bytes pro record is
considerable, because it is linearly scaled with number of records.
Also, in protobuf, afaik, there are no dictionaries at all - lists of
pairs are used instead, and it is possible to use any type as key.
(http://stackoverflow.com/questions/4194845/dictionary-in-protocol-buffers?rq=1).
This is also one of the workarounds for this restriction in avro, but
doesn't solve the case with thrift->avro conversion.

So, in regards to reason 1 I have serious doubts. I am really interested
in Doug Cutting's and community opinion.

In regards to reason 2 - my concerns are that maybe there are some
algorithmic limitations to have the restriction, or other parts of the
system that heavily rely on this (MapReduce, Pig, etc). But my brief
research on that did not lead to any reasoning, why keys type should be
restricted to String. I also admit, that it may be a bit more complexity
to implement it comparing to Strings-restriction solution, but it will
definitely throw away all the work-arounds and suffering that users of
avro have about it (and generally will lead to less complexity overall).
So, in this case, IMHO, more is less :)

I am really looking forward to feedback from community to discuss and
rethink this restriction.

Best regards,
Michael Pershyn

Reply via email to