Re: Getting term vectors/computing cosine similarity

Andi Vajda Tue, 27 May 2014 23:11:26 -0700

> On May 27, 2014, at 19:17, "Michael O'Leary" <mich...@moz.com> wrote:
> 
> *tl;dnr*: a next() method is defined for the Java class TVTermsEnum in
> Lucene 4.8.1, but it looks like there is no next() method available for an
> object that looks like it is an instance of the Python class TVTermsEnum in
> PyLucene 4.8.1.


If there is a next() method, there is a good chance the object is even iterable 
(in the python sense). You may need to cast it first, though, as the api that 
returned it to you may not be defined to return TVTermsEnum:
  TVTermsEnum.cast_(obj)

A good place for PyLucene code examples is its suite of unit tests. It also has 
a few samples - way less than in 3.x releases because the APIs changed too much.
I'm pretty sure there is a test involving TermsEnum in the tests directory.

Andi..

> I have a set of documents that I would like to cluster. These documents
> share a vocabulary of only about 3,000 unique terms, but there are about
> 15,000,000 documents. One way I thought of doing this would be to index the
> documents using PyLucene (Python is the preferred programming language at
> work), obtain term vectors for the documents using PyLucene API functions,
> and calculate cosine similarities between pairs of term vectors in order to
> determine which documents are close to each other.
> 
> I found some sample Java code on the web that various people have posted
> showing ways to do this with older versions of Lucene. I downloaded
> PyLucene 4.8.1 and compared its API functions with the ones used in the
> code samples, and saw that this is an area of Lucene that has changed quite
> a bit. I can send an email to the lucene-user mailing group to ask what
> would be a good way of doing this using version 4.8.1, but the question I
> have for this mailing group has to do with some Java API functions that it
> looks like are not exposed in Python, unless I have to go about accessing
> them in a different way.
> 
> If I obtain the term vector for the field "cat_ids" in a document with id
> doc_id_1
> 
> doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids")
> 
> then doc_1_tfv is displayed as this object:
> 
> <Terms:
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396
> 
> In some of the sample code I looked at, the terms in doc_1_tfv could be
> obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a
> member function of Terms or its subclasses any more. In another code
> sample, an iterator for the term vector is obtained via tfv_iter =
> doc_1_tfv.iterator(None) and then the terms are obtained one by one with
> calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this
> value:
> 
> <TermsEnum:
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369
> 
> and there is a next() function defined for the TVTermsEnum class, but this
> object doesn't list next() as one of its member functions and an exception
> is raised if it is called. It looks like the object only supports the
> member functions defined for the TermsEnum class, and next() is not one of
> them. Is this the case, or is there a way have it support all of the
> TVTermsEnum member functions, including next()? TVTermsEnum is a private
> class in CompressingTermVectorsReader.java.
> 
> So I am wondering if there is a way to obtain term vectors in this way and
> that I am just not treating doc_1_tfv and tfv_iter in the right way, or if
> there is a different, better way to get term vectors for documents in a
> PyLucene index, or if this isn't something that Lucene should be used for.
> Thank you very much for any help you can provide.
> Mike

Re: Getting term vectors/computing cosine similarity

Reply via email to