On Thu, Feb 4, 2010 at 8:13 PM, Drew Farris <drew.far...@gmail.com> wrote:

> On Thu, Feb 4, 2010 at 7:28 AM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > Since I was converting vectorization in to sequence files. I was going to
> > change the lucene Driver to write dictionary to sequence file instead of
> tab
> > separated text file. Also I will change the cluster dumper to read the
> > dictionary from the sequence File.
>
> Sounds good.
>
> > Iterator<Writable, Writable> interface where SequenceFile reader/writer
> is
> > one implementation, Tab separated file reader/writer is another
>
> I like this, but also how about providing a utility to go from
> tab-delimited dict format to SequenceFile format. This way there's a
> migration path for old datasets.
>
> > 2nd Thing:
> >  Lucene seems too slow for querying dictionary vectorization
> > 1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read
> of
> > wikipedia dump with a hashmap is faster than single read using a lucene
> > index
>
> Is the double read approach described in one of the previous threads
> discussion this issue? Just curious how it works..
>
> > 3rd thing:
> > I am planning to convert the launcher code to implement ToolRunner.
> Anyone
> > volunteer to help me with that?
>
> Sure, I can help out. What classes need to be updated? I've patched
> the clustering code in the past, that's probably a natural start.
> Sean, I'll take a look at AbstractJob and what would be involved in
> re-using it in the Clustering code.
>
> With ToolRunner, we get GenericOptionsParser for free, and the
> launcher classes must implement Tool and Configurable, right?
> ToolRunner is specific to the 0.20 api, isn't it?
>
> I did notice that Eclipse was complaining about GenericOptionsParser
> last night because commons-cli 1.x wasn't available. I had to remove
> its exclusion in the parent pom to get things to work properly, anyone
> else run into this or is this something funky in my environment.
>
> > 4th thing:
> > Any thoughts how we can integrate output of n-gram map/reduce to generate
> > vectors from dataset
>
> So are you speaking of n-grams in general, or the output of the colloc
> work? I suppose I should wrap up the process of writing the top
> collocations to a file which can be read into a bloom filter which can
> be integrated into phase of the document vectorization process that
> performs tokenization. The document vectorization code could use the
> shingle filter to produce ngrams and emit those that passed the bloom
> filter.
>
> There's some feedback I'm looking for on MAHOUT-242 related to this,
> that would be helpful, questions about the best way to produce the set
> of top collocations.
>
> Robin, have you considered adding a step to the document vectorization
> process that would produce output that's a token stream instead of a
> vector?
>
> Instead of:
> Document Drectory -> Document Sequence File
> Document Sequence File -> Document Vectors + Dictionary
>
> Document Directory -> Document Sequence File
> Document Sequence File -> Document Token Streams
> Document Token Streams -> Document Vectors + Dictionary
>
Ok I will work on this Job.
Also partial Vector merger could be reused by colloc when creating ngram
only vectors. But we need to keep adding to the dictionary file. If you can
work on a dictionary merger + chunker, it will be great. I think we can do
this integration quickly



> This way, something like the colloc/n-gram process would read the
> output of the second pass (Document Token Streams file) instead of
> having to re-tokenize everything simply to obtain token streams.
>
> > 5th The release:
> > Fix a date for 0.3 release? We should look to improve quality in this
> > release. i.e In-terms of running the parts of the code each of us haven't
> > tested (like I have run bayes and fp growth many a time, So, I will focus
> on
> > running clustering algorithms and try out various options see if there is
> > any issue) provide feedback so that the one who wrote it can help tweak
> it?
>
> It is probably time to resurrect Sean's thread from last week and see
> how we stand on the issues listed there.
>
> Drew
>

Reply via email to