Hi Florian, Perhaps you might run into issues with using an ngram. How I see it is that you need tokenized urls and need to run an exact search using a keyword tokenizer on the search string. You could try this. I am assuming it'll work. so something like en.wikipedia.org/wiki/production_code/test gets tokenized as [en] [wikipedia] [org] [wiki[ [production_code] [test]
so an exact search for any set of subsequent (while maintaining the order) would get you the result. And yes, you might want to look at your tokenizers a little bit. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw............ On Sun, Sep 20, 2009 at 3:30 AM, AHMET ARSLAN <iori...@yahoo.com> wrote: > > Dear List, > > > > I'm working on a project where i have to check a Blacklist > > of URL's with Lucene. (about 500.000) > > Is it possible to search for a URL in a hierarchical > > context? > > > > for Example: > > Blacklist entry: "en.wikipedia.org/wiki/production_code" > > > > "en.wikipedia.org/wiki/production_code/test" should match > > "en.wikipedia.org/wiki/test" should not match > > If any substring (0 to n) of your query matches a document completely than > that query should match, right? Thats what I understand from your examples. > > You can achieve this bu using two different analyzers for index and query > time. > > query analyzer: > > KeywordTokenizer > EdgeNGramTokenFilter (side = EdgeNGramTokenFilter.Side.FRONT , mingram = 1, > maxgram=512) > > index analyzer: > > KeywordTokenizer > > The index analyzer comes out-of-the-box: > org.apache.lucene.analysis.KeywordAnalyzer > But you need to write query analyzer. > > If you want case-insensitive search you can add LowercaseFilter to both of > your analyzers. > > By using this, your black list urls will be indexed verbatim. (one token) > > Your query "en.wikipedia.org/wiki/production_code/test" > will be broken in to these pieces and one of them will match your document: > > e > en > en. > en.w > en.wi > en.wik > en.wiki > en.wikip > en.wikipe > en.wikiped > en.wikipedi > en.wikipedia > en.wikipedia. > en.wikipedia.o > en.wikipedia.or > en.wikipedia.org > en.wikipedia.org/ > en.wikipedia.org/w > en.wikipedia.org/wi > en.wikipedia.org/wik > en.wikipedia.org/wiki > en.wikipedia.org/wiki/ > en.wikipedia.org/wiki/p > en.wikipedia.org/wiki/pr > en.wikipedia.org/wiki/pro > en.wikipedia.org/wiki/prod > en.wikipedia.org/wiki/produ > en.wikipedia.org/wiki/produc > en.wikipedia.org/wiki/product > en.wikipedia.org/wiki/producti > en.wikipedia.org/wiki/productio > en.wikipedia.org/wiki/production > en.wikipedia.org/wiki/production_ > en.wikipedia.org/wiki/production_c > en.wikipedia.org/wiki/production_co > en.wikipedia.org/wiki/production_cod > * en.wikipedia.org/wiki/production_code // this is your document a match > en.wikipedia.org/wiki/production_code/ > en.wikipedia.org/wiki/production_code/t > en.wikipedia.org/wiki/production_code/te > en.wikipedia.org/wiki/production_code/tes > en.wikipedia.org/wiki/production_code/test > > The none of the pieces of the query "en.wikipedia.org/wiki/test" will > match your document. > > Hope this helps. > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >