(the 4th one is typically a kwarg it didn't care about)

On Tuesday, July 1, 2014, Lars Buitinck <[email protected]> wrote:

> 2014-07-01 23:44 GMT+02:00 Joel Nothman <[email protected]
> <javascript:;>>:
> > Calculating TfIdf really isn't that hard. It's much easier for you to do
> so
> > while transforming that into DictVectorizer input than for the library
> to be
> > everything to everyone.
>
> Indeed. I just indexed 20news in ES, then did
>
>
> $ curl -XGET 'http://localhost:9200/20news/post/1/_termvector?pretty=true'
> -d '{
>   "fields" : ["text"],
>   "offsets" : true,
>   "payloads" : true,
>   "positions" : true,
>   "term_statistics" : true,
>   "field_statistics" : true
> }' > 1.json
>
>
> etc. for three documents. If I'm not mistaken, the following is tf-idf
> from ES in 10 lines of code, not counting imports:
>
>
> import json
> from math import log
> from sklearn.feature_extraction import DictVectorizer
>
> n_docs = 11314    # num. docs in 20news corpus
> hits = [json.load(open(f)) for f in ["1.json", "2.json", "3.json"]]
>
> def tfidf(tf, df):
>     idf = log(n_docs / float(df))
>     return tf * idf
>
> def terms_from_es_json(doc):
>     terms = doc["term_vectors"]["text"]["terms"]
>     return {k: tfidf(v["term_freq"], v["doc_freq"]) for k, v in
> terms.items()}
>
> v = DictVectorizer()
> X_tfidf = v.fit_transform(terms_from_es_json(hit) for hit in hits)
>
>
> (This can obviously be improved by doing everything in Python, but I
> can't currently figure out how to get term vectors from the ES Python
> client. It tells me I'm passing 4 arguments where I should have been
> passing 4, which is of course a stupid mistake but I don't know what
> the right value of 4 is today.)
>
>
> ------------------------------------------------------------------------------
> Open source business process management suite built on Java and Eclipse
> Turn processes into business applications with Bonita BPM Community Edition
> Quickly connect people, data, and systems into organized workflows
> Winner of BOSSIE, CODIE, OW2 and Gartner awards
> http://p.sf.net/sfu/Bonitasoft
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected] <javascript:;>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to