Hi All

I have been working on a news classification project using documents
indexed in ElasticSearch as my training set. So my documents are analyzed
using Lucene analyzers and I have access to the term vectors. (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-termvectors.html
)

I have written my own transformer/vectorizer to take in a bunch of term
vectors from the search index and convert it to a tfidf weighted training
set that I then train using NearestCentroid classifier. I could not use
TfIdfVectorizer/Transformer since this computes the idf from the given
corpus. In my particular case, I have a very large document corpus with a
small labelled set of exemplar documents. I wish to use tfidf based on
document frequency counts computed on this larger corpus rather than the
smaller set of exemplars which is a non representative sample of my
training set size.

Would it be useful to contribute a transformer/vectorizer that takes in as
input a set of term vectors as input - if so, I would love to work with you
further and contribute this code to sklearn.

I imagine this transformer would be useful to others who use lucene for
text analysis and already have access to term vectors and have the partial
pipeline but might still want access to the various weighting schemes
available in TfidfVectorizer (ex: norm, smooth_idf, sublinear_tf etc)

Thanks !
Geetu
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to