On Thu, Feb 4, 2010 at 7:28 AM, Robin Anil <robin.a...@gmail.com> wrote:

> Since I was converting vectorization in to sequence files. I was going to
> change the lucene Driver to write dictionary to sequence file instead of tab
> separated text file. Also I will change the cluster dumper to read the
> dictionary from the sequence File.

Sounds good.

> Iterator<Writable, Writable> interface where SequenceFile reader/writer is
> one implementation, Tab separated file reader/writer is another

I like this, but also how about providing a utility to go from
tab-delimited dict format to SequenceFile format. This way there's a
migration path for old datasets.

> 2nd Thing:
>  Lucene seems too slow for querying dictionary vectorization
> 1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read of
> wikipedia dump with a hashmap is faster than single read using a lucene
> index

Is the double read approach described in one of the previous threads
discussion this issue? Just curious how it works..

> 3rd thing:
> I am planning to convert the launcher code to implement ToolRunner. Anyone
> volunteer to help me with that?

Sure, I can help out. What classes need to be updated? I've patched
the clustering code in the past, that's probably a natural start.
Sean, I'll take a look at AbstractJob and what would be involved in
re-using it in the Clustering code.

With ToolRunner, we get GenericOptionsParser for free, and the
launcher classes must implement Tool and Configurable, right?
ToolRunner is specific to the 0.20 api, isn't it?

I did notice that Eclipse was complaining about GenericOptionsParser
last night because commons-cli 1.x wasn't available. I had to remove
its exclusion in the parent pom to get things to work properly, anyone
else run into this or is this something funky in my environment.

> 4th thing:
> Any thoughts how we can integrate output of n-gram map/reduce to generate
> vectors from dataset

So are you speaking of n-grams in general, or the output of the colloc
work? I suppose I should wrap up the process of writing the top
collocations to a file which can be read into a bloom filter which can
be integrated into phase of the document vectorization process that
performs tokenization. The document vectorization code could use the
shingle filter to produce ngrams and emit those that passed the bloom
filter.

There's some feedback I'm looking for on MAHOUT-242 related to this,
that would be helpful, questions about the best way to produce the set
of top collocations.

Robin, have you considered adding a step to the document vectorization
process that would produce output that's a token stream instead of a
vector?

Instead of:
Document Drectory -> Document Sequence File
Document Sequence File -> Document Vectors + Dictionary

Document Directory -> Document Sequence File
Document Sequence File -> Document Token Streams
Document Token Streams -> Document Vectors + Dictionary

This way, something like the colloc/n-gram process would read the
output of the second pass (Document Token Streams file) instead of
having to re-tokenize everything simply to obtain token streams.

> 5th The release:
> Fix a date for 0.3 release? We should look to improve quality in this
> release. i.e In-terms of running the parts of the code each of us haven't
> tested (like I have run bayes and fp growth many a time, So, I will focus on
> running clustering algorithms and try out various options see if there is
> any issue) provide feedback so that the one who wrote it can help tweak it?

It is probably time to resurrect Sean's thread from last week and see
how we stand on the issues listed there.

Drew

Reply via email to