I am trying to sort some data. The data had names and I was try to sort
in the following manner.
*ORIGINAL DATA* * SORTED DATA*
/Rahul shekhar/
/rahul Sameer/
/RAHUL ===== rahul/
/shekar ===== Rahul/
/hans RAHul/
/kasper kasper/
/Sameer hans/
/
/
This was a bit customized Sorting where I wanted to first sort them in
lexicographic manner and then maybe take capitalization also into
consideration.
Initially I was trying with the Sort API but was unsuccessful with that.
But then I tried in a couple of ways as explained below :
In the first solution, I outputted each of the names them against their
starting character in a /Ptable/. Then collected all the values for a
particular key.
After that I selected all the values and then used a /Comparator /to
sort data in each of the collection.
/PTable<String, String> classifiedData = count.parallelDo( new
NamesClassification(),Writables.tableOf(Writables.strings(),Writables.strings()));
PTable<String, Collection<String> collectedValues =
classifiedData.collectValues();
PCollection<Collection<String> names = collectedValues.values();
PCollection<Collection<String>> sortedNames = names.parallelDo("names
Sorting",new NamesSorting(), Writables.collections(Writables.strings()));/
Not completely convinced with the path I took. I spend some time of
solving it and found another way of doing same.
In the second solution, I created my own writable type that implemented
WritableComparable. Also implemented all the mapping functions for the
same, so that it can be used with crunch WritableTypes.
/class NamesComparable implements WritableComparable<NamesComparable>{ ......}
MapFn<String,//NamesComparable//> string_to_names =.........
MapFn<//NamesComparable,String//> names_to_string =........./
/
/
Then I used this while converting the read data into it and then
sorting it.
PCollection<String> readLines = pipeline.readTextFile(fileLoc);
PCollection<String> lines = readLines.parallelDo(new DoFn<String, String>()
{
@Override
public void process(String input, Emitter<String> emitter) {
emitter.emit(input);}},
*stringToNames*());
PCollection<String> sortedData = Sort.sort(lines, Order.DESCENDING);
I found of these methods as quite tricky that give a feeling of going
around a bush. Is there a better way of accomplishing the same ? Have I
missed some aspects ?
If not, then I believe there is scope of having an Sorting API that can
have support of some customizations.
regards
Rahul