Also, seems like we could make the Wikipedia example just a bit more
generic by not restricting to just countries, right? The code appears
to just looks to see if a Category contains some entry in a list (in
the current case, countries) and then labels the doc that way, but
there really isn't anything special about that. For instance, I could
have categories by subject, i.e. History, Math, etc. right?
-Grant
On Jul 15, 2009, at 5:01 PM, Robin Anil wrote:
Hi Grant,
For Bayes input is a tab separated flat files. with each document is
in a line. Label as the first word followed by a tab and followed by
the flattened document.
I will be travelling the next 3 days, as I am relocating to my Job
location. So I hope i will be able to give you the documentation of
the same by Monday morning.
Robin
On Thu, Jul 16, 2009 at 1:02 AM, Grant Ingersoll
<[email protected]> wrote:
Hi Robin,
I have been looking a bit at the classification stuff a bit more and
am wondering if we should be switching to use Vectors now, since the
name could be the label and the value can contain weights, similar
to what we do for clustering.
Also, I was wondering if you could document the format used for the
input files now and the steps taken by the algorithms. I'm trying
to better understand the Wikipedia examples and also the HBase.
-Grant
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search