1st Thing: Since I was converting vectorization in to sequence files. I was going to change the lucene Driver to write dictionary to sequence file instead of tab separated text file. Also I will change the cluster dumper to read the dictionary from the sequence File.
I can go about in three ways Stick to only SequenceFile Format for the dictionary and remove tab separated thing out of the system OR Iterator<Writable, Writable> interface where SequenceFile reader/writer is one implementation, Tab separated file reader/writer is another 2nd Thing: Lucene seems too slow for querying dictionary vectorization 1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read of wikipedia dump with a hashmap is faster than single read using a lucene index 3rd thing: I am planning to convert the launcher code to implement ToolRunner. Anyone volunteer to help me with that? 4th thing: Any thoughts how we can integrate output of n-gram map/reduce to generate vectors from dataset 5th The release: Fix a date for 0.3 release? We should look to improve quality in this release. i.e In-terms of running the parts of the code each of us haven't tested (like I have run bayes and fp growth many a time, So, I will focus on running clustering algorithms and try out various options see if there is any issue) provide feedback so that the one who wrote it can help tweak it? Maybe time the code when we run it and put it on the wiki ? Can we set a Sprint week when we will be doing this. Comments awaited Robin