2014-07-01 23:44 GMT+02:00 Joel Nothman <[email protected]>: > Calculating TfIdf really isn't that hard. It's much easier for you to do so > while transforming that into DictVectorizer input than for the library to be > everything to everyone.
Indeed. I just indexed 20news in ES, then did $ curl -XGET 'http://localhost:9200/20news/post/1/_termvector?pretty=true' -d '{ "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }' > 1.json etc. for three documents. If I'm not mistaken, the following is tf-idf from ES in 10 lines of code, not counting imports: import json from math import log from sklearn.feature_extraction import DictVectorizer n_docs = 11314 # num. docs in 20news corpus hits = [json.load(open(f)) for f in ["1.json", "2.json", "3.json"]] def tfidf(tf, df): idf = log(n_docs / float(df)) return tf * idf def terms_from_es_json(doc): terms = doc["term_vectors"]["text"]["terms"] return {k: tfidf(v["term_freq"], v["doc_freq"]) for k, v in terms.items()} v = DictVectorizer() X_tfidf = v.fit_transform(terms_from_es_json(hit) for hit in hits) (This can obviously be improved by doing everything in Python, but I can't currently figure out how to get term vectors from the ES Python client. It tells me I'm passing 4 arguments where I should have been passing 4, which is of course a stupid mistake but I don't know what the right value of 4 is today.) ------------------------------------------------------------------------------ Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
