Re: Getting term vectors/computing cosine similarity

Michael O'Leary Wed, 28 May 2014 00:04:09 -0700

Hi Andi,
Thanks for the help. I just tried to import TVTermsEnum so I could try
casting my iter, and I don't see how to do it since TVTermsEnum is a
private class with fully qualified
name 
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.
I tried


from org.apache.lucene.codecs.compressing import
CompressingTermVectorsReader$TVTermsEnum
from org.apache.lucene.codecs.compressing import TVTermsEnum
and
import org.apache.lucene.codecs.compressing

but none of them provided access to TVTermsEnum (the first two raised
exceptions). After running import org.apache.lucene.codecs.compressing, I
could do dir(org.apache.lucene.codecs.compressing) and see the contents of
that module. CompressingTermVectorsReader was listed, but TVTermsEnum
wasn't. TVTermsEnum also wasn't listed in the output of
dir(org.apache.lucene.codecs.compressing.CompressingTermVectorsReader). So
it looks like my first problem is how to get access to TVTermsEnum.
Mike


On Tue, May 27, 2014 at 11:10 PM, Andi Vajda <va...@apache.org> wrote:

>
> > On May 27, 2014, at 19:17, "Michael O'Leary" <mich...@moz.com> wrote:
> >
> > *tl;dnr*: a next() method is defined for the Java class TVTermsEnum in
> > Lucene 4.8.1, but it looks like there is no next() method available for
> an
> > object that looks like it is an instance of the Python class TVTermsEnum
> in
> > PyLucene 4.8.1.
>
> If there is a next() method, there is a good chance the object is even
> iterable (in the python sense). You may need to cast it first, though, as
> the api that returned it to you may not be defined to return TVTermsEnum:
>   TVTermsEnum.cast_(obj)
>
> A good place for PyLucene code examples is its suite of unit tests. It
> also has a few samples - way less than in 3.x releases because the APIs
> changed too much.
> I'm pretty sure there is a test involving TermsEnum in the tests directory.
>
> Andi..
>
> > I have a set of documents that I would like to cluster. These documents
> > share a vocabulary of only about 3,000 unique terms, but there are about
> > 15,000,000 documents. One way I thought of doing this would be to index
> the
> > documents using PyLucene (Python is the preferred programming language at
> > work), obtain term vectors for the documents using PyLucene API
> functions,
> > and calculate cosine similarities between pairs of term vectors in order
> to
> > determine which documents are close to each other.
> >
> > I found some sample Java code on the web that various people have posted
> > showing ways to do this with older versions of Lucene. I downloaded
> > PyLucene 4.8.1 and compared its API functions with the ones used in the
> > code samples, and saw that this is an area of Lucene that has changed
> quite
> > a bit. I can send an email to the lucene-user mailing group to ask what
> > would be a good way of doing this using version 4.8.1, but the question I
> > have for this mailing group has to do with some Java API functions that
> it
> > looks like are not exposed in Python, unless I have to go about accessing
> > them in a different way.
> >
> > If I obtain the term vector for the field "cat_ids" in a document with id
> > doc_id_1
> >
> > doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids")
> >
> > then doc_1_tfv is displayed as this object:
> >
> > <Terms:
> >
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396
> >
> > In some of the sample code I looked at, the terms in doc_1_tfv could be
> > obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a
> > member function of Terms or its subclasses any more. In another code
> > sample, an iterator for the term vector is obtained via tfv_iter =
> > doc_1_tfv.iterator(None) and then the terms are obtained one by one with
> > calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this
> > value:
> >
> > <TermsEnum:
> >
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369
> >
> > and there is a next() function defined for the TVTermsEnum class, but
> this
> > object doesn't list next() as one of its member functions and an
> exception
> > is raised if it is called. It looks like the object only supports the
> > member functions defined for the TermsEnum class, and next() is not one
> of
> > them. Is this the case, or is there a way have it support all of the
> > TVTermsEnum member functions, including next()? TVTermsEnum is a private
> > class in CompressingTermVectorsReader.java.
> >
> > So I am wondering if there is a way to obtain term vectors in this way
> and
> > that I am just not treating doc_1_tfv and tfv_iter in the right way, or
> if
> > there is a different, better way to get term vectors for documents in a
> > PyLucene index, or if this isn't something that Lucene should be used
> for.
> > Thank you very much for any help you can provide.
> > Mike
>

Re: Getting term vectors/computing cosine similarity

Reply via email to