2012/9/24 Ark <[email protected]>: > Olivier Grisel <olivier.grisel@...> writes: > >> You can use the Pipeline class to build a compound classifier that >> binds a text feature extractor with a classifier to get a text >> document classifier in the end. >> > Done! > >> >> 7s is very long. How long is your text document in bytes ? > The text documents are around 50kB.
That should not take 7s to extract a TF-IDF for a single 50kb document. There must be a bug, can you please put a minimalistic code snippet + example document that reproduce the issue on a gist? http://gist.github.com >> Maybe you >> could Only consider the first kilobytes of the documents and ignore >> the remaining text as testing time (while use the complete documents >> at training time). >> > > Er, I think I am missing something here, if I consider only first few > kilobytes > wouldnt that mean that I loose the features in the rest of the document which > in > turn might lead to false match. Yes it's a trade off between processing speed and accuracy. It has to be empirically evaluated to know what size threshold should be used for your problem in practice. If you loose 0.01 in prediction accuracy but gain a 10x processing speed up it might be very well worth doing it. But for small-ish 50kB documents it should not be useful. It probably useful when the documents are larger than 1MB each. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
