I have a data structure that is a variable-length array of strings. Call it a StringList. I am using StringLists as Hadoop keys. These objects sort lexicographically (e.g. ["apple", "banana"] < ["apple", "banana", "pear"] < ["apple", "pear"] < ["zucchini"]) and are equivalent if and only if all of their elements are equal. What is the best way to implement this object for Hadoop?
Currently I have implemented StringList as an object that extends ArrayWritable and sets the value class to Text. The compareTo method just compares string representations of the StringList objects, since these representations have the ordering property I desire. This works but I'm uncertain about how it will perform at scale. In order to get the highest performance, would I still have to write a raw comparator for this object, or does ArrayWritable do this for me? In lieu of writing a raw comparator, should I just implement StringList as an Avro object? I think Avro gives you raw comparators for free, but I haven't dug into this.
