I am currently using [TfidfVectorizer and SGDClassifier] for document classification with ~3000 categories. (n_samples, n_features) = (14000, 400000)
In my case the dataset the decision that a particular document belonged to a particular category was based upon human observation (very initially) and later regexes. Each of the documents are then post processed for information based on the category selected. The distinction of the choice of category might affect the generation of few critical parts of data differently. [Also to note the dataset is unbalanced as of now, as in a few categories have 5 documents, where as some might go upto 60 documents, I did not want to try to balance this until the number of categories was evaluated]. Upon moving to the machine learning approach, we realised the issue is misclassification or even duplication of categories in some cases. In order to get a rough estimate I was thinking of a clustering approach like k means, however since the number of categories might be less than 3000, does this seem to be the correct approach? Or if there a better solution, would certainly appreciate pointers. Regards, Ark ------------------------------------------------------------------------------ How ServiceNow helps IT people transform IT departments: 1. Consolidate legacy IT systems to a single system of record for IT 2. Standardize and globalize service processes across IT 3. Implement zero-touch automation to replace manual, redundant tasks http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general