Hi benglish,
> 1. When making the index file (according to my previous post), and running
the code for the first time, I can see that in the line:
ClassificationResult<BytesRef> result =
classifier.assignClass(doc.get("content"));
String classified = result.getAssignedClass().utf8ToString();
"classified" is set to "write.block" and it causes the algorithm to find
many non-matching pairs!!! Could you tell me what I can do to overcome this
issue? I made the index for the second time and the issues got solved, but I
want to know why it does not work by the first index file!!!!
I don't know why "write.block" is returned.
Are you sure you made a correct Lucene index?
If not, why don't you use Solr to create the index?
2. As far as I have understood, your test dataset is just your training
dataset, am I right? If not, should I make an index file for the test
dataset, too?
Yes, your understanding is correct. But you don't need to index your
test set to categorize them. Once you call train(), you can call
assignClass("your test content string here") to get the category tag.
Koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html