Hi, You can replace the term by their hash directly in the analyzer chain. Just write a custom TermToBytesRef attribute that hashes the term to a constant-length byte[] (using a AttributeFactory)! :-) This would give you all features of hashed, constant length terms, but you would lose prefix and wildcard queries. In fact, NumericTokenStream is doing this for numeric!
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Adrien Grand [mailto:jpou...@gmail.com] > Sent: Tuesday, July 09, 2013 11:25 PM > To: java-user@lucene.apache.org > Subject: Re: posting list strings > > Hi, > > Lucene stores the string because it may need it to run prefix or range > queries. We don't have a hash-based terms dictionary right now but I know > some people wrote one since they don't need support for these queries, see > for instance the Earlybird paper[1]. Then if you can find a perfect hashing > function, you can just replace your terms by their hash. > > [1] > http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012. > pdf > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org