Hi Andi, Thanks for the help. I just tried to import TVTermsEnum so I could try casting my iter, and I don't see how to do it since TVTermsEnum is a private class with fully qualified name org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum. I tried
from org.apache.lucene.codecs.compressing import CompressingTermVectorsReader$TVTermsEnum from org.apache.lucene.codecs.compressing import TVTermsEnum and import org.apache.lucene.codecs.compressing but none of them provided access to TVTermsEnum (the first two raised exceptions). After running import org.apache.lucene.codecs.compressing, I could do dir(org.apache.lucene.codecs.compressing) and see the contents of that module. CompressingTermVectorsReader was listed, but TVTermsEnum wasn't. TVTermsEnum also wasn't listed in the output of dir(org.apache.lucene.codecs.compressing.CompressingTermVectorsReader). So it looks like my first problem is how to get access to TVTermsEnum. Mike On Tue, May 27, 2014 at 11:10 PM, Andi Vajda <va...@apache.org> wrote: > > > On May 27, 2014, at 19:17, "Michael O'Leary" <mich...@moz.com> wrote: > > > > *tl;dnr*: a next() method is defined for the Java class TVTermsEnum in > > Lucene 4.8.1, but it looks like there is no next() method available for > an > > object that looks like it is an instance of the Python class TVTermsEnum > in > > PyLucene 4.8.1. > > If there is a next() method, there is a good chance the object is even > iterable (in the python sense). You may need to cast it first, though, as > the api that returned it to you may not be defined to return TVTermsEnum: > TVTermsEnum.cast_(obj) > > A good place for PyLucene code examples is its suite of unit tests. It > also has a few samples - way less than in 3.x releases because the APIs > changed too much. > I'm pretty sure there is a test involving TermsEnum in the tests directory. > > Andi.. > > > I have a set of documents that I would like to cluster. These documents > > share a vocabulary of only about 3,000 unique terms, but there are about > > 15,000,000 documents. One way I thought of doing this would be to index > the > > documents using PyLucene (Python is the preferred programming language at > > work), obtain term vectors for the documents using PyLucene API > functions, > > and calculate cosine similarities between pairs of term vectors in order > to > > determine which documents are close to each other. > > > > I found some sample Java code on the web that various people have posted > > showing ways to do this with older versions of Lucene. I downloaded > > PyLucene 4.8.1 and compared its API functions with the ones used in the > > code samples, and saw that this is an area of Lucene that has changed > quite > > a bit. I can send an email to the lucene-user mailing group to ask what > > would be a good way of doing this using version 4.8.1, but the question I > > have for this mailing group has to do with some Java API functions that > it > > looks like are not exposed in Python, unless I have to go about accessing > > them in a different way. > > > > If I obtain the term vector for the field "cat_ids" in a document with id > > doc_id_1 > > > > doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids") > > > > then doc_1_tfv is displayed as this object: > > > > <Terms: > > > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396 > > > > In some of the sample code I looked at, the terms in doc_1_tfv could be > > obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a > > member function of Terms or its subclasses any more. In another code > > sample, an iterator for the term vector is obtained via tfv_iter = > > doc_1_tfv.iterator(None) and then the terms are obtained one by one with > > calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this > > value: > > > > <TermsEnum: > > > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369 > > > > and there is a next() function defined for the TVTermsEnum class, but > this > > object doesn't list next() as one of its member functions and an > exception > > is raised if it is called. It looks like the object only supports the > > member functions defined for the TermsEnum class, and next() is not one > of > > them. Is this the case, or is there a way have it support all of the > > TVTermsEnum member functions, including next()? TVTermsEnum is a private > > class in CompressingTermVectorsReader.java. > > > > So I am wondering if there is a way to obtain term vectors in this way > and > > that I am just not treating doc_1_tfv and tfv_iter in the right way, or > if > > there is a different, better way to get term vectors for documents in a > > PyLucene index, or if this isn't something that Lucene should be used > for. > > Thank you very much for any help you can provide. > > Mike >