hi, i'd like to ask users for their experiences with the fastest way to access the term dictionary.
what i want to do is to implement some algorithms to find phrases (e.g. mutual rank ratio [1]) (and other statistics on term distribution, generally: corpus related stuff) the idea would be to do statistics on numbers (i.e. long from the term dictionary) to minimize memory usage. i did try this with termsEnum + ordinal number of terms, which are easily retrievable, but getting a term by ord then throws UnsupportedOperationException [2]. i see there's also a codec blocktreeord [3]. now before diving deeper into this (i.e. changing codecs for my indexes), i'd like to ask if a workflow like described above is considered at least semi smart or if i'm on the wrong track with this and there's a smarter way to be able to not having to calculate collocations based an actualy strings or byteRefs? any pointer really appreciated. kind regard jürgen [1] http://www.google.ch/patents/US20100250238 [2] https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/SegmentTermsEnum.java [3] https://github.com/apache/lucene-solr/blob/master/lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/OrdsSegmentTermsEnum.java *Jürgen Jakobitsch* Innovation Director Semantic Web Company GmbH EU: +43-1-4021235-0 Mobile: +43-676-6212710 <+43%20676%206212710> http://www.semantic-web.at http://www.poolparty.biz PERSONAL INFORMATION | web : http://www.turnguard.com | foaf : http://www.turnguard.com/turnguard | g+ : https://plus.google.com/111233759991616358206/posts | skype : jakobitsch-punkt | xmlns:tg = "http://www.turnguard.com/turnguard#" | blockchain : https://onename.com/turnguard