Re: Clustering Demo

Karl Wettin Sat, 24 May 2008 09:11:25 -0700


24 maj 2008 kl. 13.13 skrev Grant Ingersoll:

These are interesting. Perhaps you want to commit LUCENE-725?

If I end up using it for this, then I will. Never tried it out andthere are no test cases so I have no clue to how well it works. Norare there any demonstrations of the features in the patch, but Isuppose our demo could be used to produce that.

I'll train it with the last few paragraphs on a per-author basis toosee how well it works.

We might want to wash out stuff like "24 maj 2008 kl. 13.13 skrevGrant Ingersoll" too. That should not be to hard to figure out usingthe headers if the data is stored in a way that allows for navigationin the thread.

But I'm honestly not sure if this is preemptive overkill solutions.Perhaps algorithms automatically penalise unrelated text when givenenough semiotic data. Perhaps attribute selection does the same job ina shorter time.

I was wondering whether we should consider asking Lucene to put upan Analyzer only jar (i.e. a separate jar that combiners theAnalyzer/TokenStream definitions with the contrib Analyzerspackage.) Of course, we may have uses for the rest of Lucene aswell, so maybe not.



To me that just sounds like more work for both projects.

I'd be great if we managed to put all future text analysisimprovements as patches in Lucene rather than Mahout, but in the longrun I think we'll be branching quite a bit of the Lucene analysis codeto avoid spending time writing backwards compatible code to supportLucene- rather than Mahout users. See LUCENE-889.



     karl

Re: Clustering Demo

Reply via email to