I will look into this.

On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <[email protected]>wrote:

> Hi
>
> While playing around some more with the 20newsgroups example code for the
> Bayes classifiers, I ran into an oddity and a presumable bug:
>
> instead of using (parts of) the 20 newsgroups data set, which was split
> nicely into one file per newsgroup, with the 'category, tab, tokens' line
> format, I generated such a file out of our company data set. What I did
> though was generate 1 file to train, and 1 to test with - so both files
> could have different lines having different categories, e.g.
>
> cars    Ferrari red ....
> animals cow cat dog ....
>
> In training, this works fine.  In testing, it crashes TestClassifier with a
> null pointer exception. I presume that is because either the file name does
> not match category.txt for some category name, or because there's multiple
> categories being used inside the single file - but I also presume that
> neither should crash the thing :) It also brings up the question: if the
> line format in the data files has the category in there, then why are the
> file names relevant at all? Seems like redundancy to me. Shouldn't
> TestClassifier merely take all .txt files in the input data directory and
> process their contents?
>
> Regards,
> Loek
>

Reply via email to