I am currently using [TfidfVectorizer and SGDClassifier] for document
 classification with ~3000 categories.
 (n_samples, n_features) = (14000, 400000)

 In my case the dataset the decision that a particular document belonged to
 a particular category was based upon human observation (very initially) and
 later regexes. Each of the documents are then post processed for
 information based on the category selected. The distinction of the choice
 of category might affect the generation of few critical parts of data
 differently.
 [Also to note the dataset is unbalanced as of now, as in a few
 categories have 5 documents, where as some might go upto 60 documents,
 I did not want to try to balance this until the number of categories was
 evaluated].
 Upon moving to the machine learning approach, we realised the issue
 is misclassification or even duplication of categories in some cases.
 In order to get a rough estimate I was thinking of a clustering approach
 like k means, however since the number of categories might be less than
 3000, does this seem to be the correct approach? Or if there a better
 solution, would certainly appreciate pointers.
Regards,
Ark




------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. Consolidate legacy IT systems to a single system of record for IT
2. Standardize and globalize service processes across IT
3. Implement zero-touch automation to replace manual, redundant tasks
http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to