The term vector output from ElasticSearch is like so: (solr is also similar)

{
    "_id": "1",
    "_index": "twitter",
    "_type": "tweet",
    "_version": 1,
    "found": true,
    "term_vectors": {
        "text": {
            "field_statistics": {
                "doc_count": 2,
                "sum_doc_freq": 6,
                "sum_ttf": 8
            },
            "terms": {
                "test": {
                    "doc_freq": 2,
                    "term_freq": 3

      "ttf": 4
                },
                "twitter": {
                    "doc_freq": 2,
                    "term_freq": 1,

"ttf": 2
                }
            }
        }
    }

}

So we get individual term frequency and document frequency per field. We
need some combination of the DictVectorizer pipelined with a kind of
TfIdfTransformer that can compute tf/idf from the json data given.



On Tue, Jul 1, 2014 at 5:30 PM, Joel Nothman <[email protected]> wrote:

> Pulling the IDF out of Lucene is a little bit trickier, but otherwise
> DictVectorizer pipelined with TfidfTransformer should be able to do this.
>
>
> On 1 July 2014 16:40, Lars Buitinck <[email protected]> wrote:
>
>> 2014-07-01 21:03 GMT+02:00 Geetu Ambwani <[email protected]>:
>> > I imagine this transformer would be useful to others who use lucene for
>> text
>> > analysis and already have access to term vectors and have the partial
>> > pipeline but might still want access to the various weighting schemes
>> > available in TfidfVectorizer (ex: norm, smooth_idf, sublinear_tf etc)
>>
>> Why? Can't DictVectorizer do this?
>>
>>
>> ------------------------------------------------------------------------------
>> Open source business process management suite built on Java and Eclipse
>> Turn processes into business applications with Bonita BPM Community
>> Edition
>> Quickly connect people, data, and systems into organized workflows
>> Winner of BOSSIE, CODIE, OW2 and Gartner awards
>> http://p.sf.net/sfu/Bonitasoft
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Open source business process management suite built on Java and Eclipse
> Turn processes into business applications with Bonita BPM Community Edition
> Quickly connect people, data, and systems into organized workflows
> Winner of BOSSIE, CODIE, OW2 and Gartner awards
> http://p.sf.net/sfu/Bonitasoft
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to