2012/9/22 Ark <[email protected]>: > Hello, > I am trying to classify a large document set with LinearSVC. I get good > accuracy. However I was wondering how to optimize the interface to this > classifier. For e.g.If I have an predict interface that accepts the raw > document,
You can use the Pipeline class to build a compound classifier that binds a text feature extractor with a classifier to get a text document classifier in the end. > and uses a precomputed classifier object, the time to predicttaken is non- > trivial. In my case vectorizing the document took about 7s but predicting only > about 0.46s. 7s is very long. How long is your text document in bytes ? Maybe you could Only consider the first kilobytes of the documents and ignore the remaining text as testing time (while use the complete documents at training time). You should also probably profile your script to understand what's taking so long. For instance you can use: http://www.vrplumber.com/programming/runsnakerun/ > Hence the question is how to efficiently scale this type of model. > [e.g. Is there a way to store vocabulary, and whenever a new document predict > request is made, the tfidf transform use this stored vocabulary?] What to you mean by vocabulary? The vectorizer class already has a `vocabulary_` attribute that maps the token string names to feature integer indices as explained in the documentation: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction But the performance critical part is probably tokenizing the text in the first place. Also note, the master branch of scikit-learn has received some perf optim on this class if I am not mistaken. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ How fast is your code? 3 out of 4 devs don\\\'t know how their code performs in production. Find out how slow your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219672;13503038;z? http://info.appdynamics.com/FreeJavaPerformanceDownload.html _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
