2012/9/22 Ark <[email protected]>:
> Hello,
>     I am trying to classify a large document set with LinearSVC. I get good
>  accuracy. However I was wondering how to optimize the interface to this
> classifier. For e.g.If I have an predict interface that accepts the raw 
> document,

You can use the Pipeline class to build a compound classifier that
binds a text feature extractor with a classifier to get a text
document classifier in the end.

> and uses a precomputed classifier object, the time to predicttaken is non-
> trivial. In my case vectorizing the document took about 7s but predicting only
> about 0.46s.

7s is very long. How long is your text document in bytes ? Maybe you
could Only consider the first kilobytes of the documents and ignore
the remaining text as testing time (while use the complete documents
at training time).

You should also probably profile your script to understand what's
taking so long. For instance you can use:

  http://www.vrplumber.com/programming/runsnakerun/

> Hence the question is how to efficiently scale this type of model.
> [e.g. Is there a way to store vocabulary, and whenever a new document predict
> request is made, the tfidf transform use this stored vocabulary?]

What to you mean by vocabulary? The vectorizer class already has a
`vocabulary_` attribute that maps the token string names to feature
integer indices as explained in the documentation:

  
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

But the performance critical part is probably tokenizing the text in
the first place.

Also note, the master branch of scikit-learn has received some perf
optim on this class if I am not mistaken.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
How fast is your code?
3 out of 4 devs don\\\'t know how their code performs in production.
Find out how slow your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219672;13503038;z?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to