Also, seems like we could make the Wikipedia example just a bit more generic by not restricting to just countries, right? The code appears to just looks to see if a Category contains some entry in a list (in the current case, countries) and then labels the doc that way, but there really isn't anything special about that. For instance, I could have categories by subject, i.e. History, Math, etc. right?

-Grant

On Jul 15, 2009, at 5:01 PM, Robin Anil wrote:

Hi Grant,

For Bayes input is a tab separated flat files. with each document is in a line. Label as the first word followed by a tab and followed by the flattened document. I will be travelling the next 3 days, as I am relocating to my Job location. So I hope i will be able to give you the documentation of the same by Monday morning.

Robin

On Thu, Jul 16, 2009 at 1:02 AM, Grant Ingersoll <[email protected]> wrote:
Hi Robin,

I have been looking a bit at the classification stuff a bit more and am wondering if we should be switching to use Vectors now, since the name could be the label and the value can contain weights, similar to what we do for clustering.

Also, I was wondering if you could document the format used for the input files now and the steps taken by the algorithms. I'm trying to better understand the Wikipedia examples and also the HBase.

-Grant





--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to