2012/9/24 Ark <[email protected]>:
> Olivier Grisel <olivier.grisel@...> writes:
>
>> You can use the Pipeline class to build a compound classifier that
>> binds a text feature extractor with a classifier to get a text
>> document classifier in the end.
>>
>  Done!
>
>>
>> 7s is very long. How long is your text document in bytes ?
> The text documents are around 50kB.

That should not take 7s to extract a TF-IDF for a single 50kb
document. There must be a bug, can you please put a minimalistic code
snippet + example document that reproduce the issue on a gist?
http://gist.github.com

>> Maybe you
>> could Only consider the first kilobytes of the documents and ignore
>> the remaining text as testing time (while use the complete documents
>> at training time).
>>
>
> Er, I think I am missing something here, if I consider only first few 
> kilobytes
> wouldnt that mean that I loose the features in the rest of the document which 
> in
> turn might lead to false match.

Yes it's a trade off between processing speed and accuracy. It has to
be empirically evaluated to know what size threshold should be used
for your problem in practice. If you loose 0.01 in prediction accuracy
but gain a 10x processing speed up it might be very well worth doing
it. But for small-ish 50kB documents it should not be useful. It
probably useful when the documents are larger than 1MB each.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to