2014-07-01 23:44 GMT+02:00 Joel Nothman <[email protected]>:
> Calculating TfIdf really isn't that hard. It's much easier for you to do so
> while transforming that into DictVectorizer input than for the library to be
> everything to everyone.

Indeed. I just indexed 20news in ES, then did


$ curl -XGET 'http://localhost:9200/20news/post/1/_termvector?pretty=true' -d '{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}' > 1.json


etc. for three documents. If I'm not mistaken, the following is tf-idf
from ES in 10 lines of code, not counting imports:


import json
from math import log
from sklearn.feature_extraction import DictVectorizer

n_docs = 11314    # num. docs in 20news corpus
hits = [json.load(open(f)) for f in ["1.json", "2.json", "3.json"]]

def tfidf(tf, df):
    idf = log(n_docs / float(df))
    return tf * idf

def terms_from_es_json(doc):
    terms = doc["term_vectors"]["text"]["terms"]
    return {k: tfidf(v["term_freq"], v["doc_freq"]) for k, v in terms.items()}

v = DictVectorizer()
X_tfidf = v.fit_transform(terms_from_es_json(hit) for hit in hits)


(This can obviously be improved by doing everything in Python, but I
can't currently figure out how to get term vectors from the ES Python
client. It tells me I'm passing 4 arguments where I should have been
passing 4, which is of course a stupid mistake but I don't know what
the right value of 4 is today.)

------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to