Mahout 0.3 Plan and other changes

Robin Anil Thu, 04 Feb 2010 04:29:44 -0800

1st Thing:

Since I was converting vectorization in to sequence files. I was going to
change the lucene Driver to write dictionary to sequence file instead of tab
separated text file. Also I will change the cluster dumper to read the
dictionary from the sequence File.


I can go about in three ways

Stick to only SequenceFile Format for the dictionary and remove tab
separated thing out of the system

OR

Iterator<Writable, Writable> interface where SequenceFile reader/writer is
one implementation, Tab separated file reader/writer is another


2nd Thing:
 Lucene seems too slow for querying dictionary vectorization
1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read of
wikipedia dump with a hashmap is faster than single read using a lucene
index


3rd thing:
I am planning to convert the launcher code to implement ToolRunner. Anyone
volunteer to help me with that?

4th thing:
Any thoughts how we can integrate output of n-gram map/reduce to generate
vectors from dataset

5th The release:
Fix a date for 0.3 release? We should look to improve quality in this
release. i.e In-terms of running the parts of the code each of us haven't
tested (like I have run bayes and fp growth many a time, So, I will focus on
running clustering algorithms and try out various options see if there is
any issue) provide feedback so that the one who wrote it can help tweak it?

Maybe time the code when we run it and put it on the wiki ?

Can we set a Sprint week when we will be doing this.



Comments awaited
Robin

Mahout 0.3 Plan and other changes

Reply via email to