> Dear List, > > I'm working on a project where i have to check a Blacklist > of URL's with Lucene. (about 500.000) > Is it possible to search for a URL in a hierarchical > context? > > for Example: > Blacklist entry: "en.wikipedia.org/wiki/production_code" > > "en.wikipedia.org/wiki/production_code/test" should match > "en.wikipedia.org/wiki/test" should not match
If any substring (0 to n) of your query matches a document completely than that query should match, right? Thats what I understand from your examples. You can achieve this bu using two different analyzers for index and query time. query analyzer: KeywordTokenizer EdgeNGramTokenFilter (side = EdgeNGramTokenFilter.Side.FRONT , mingram = 1, maxgram=512) index analyzer: KeywordTokenizer The index analyzer comes out-of-the-box: org.apache.lucene.analysis.KeywordAnalyzer But you need to write query analyzer. If you want case-insensitive search you can add LowercaseFilter to both of your analyzers. By using this, your black list urls will be indexed verbatim. (one token) Your query "en.wikipedia.org/wiki/production_code/test" will be broken in to these pieces and one of them will match your document: e en en. en.w en.wi en.wik en.wiki en.wikip en.wikipe en.wikiped en.wikipedi en.wikipedia en.wikipedia. en.wikipedia.o en.wikipedia.or en.wikipedia.org en.wikipedia.org/ en.wikipedia.org/w en.wikipedia.org/wi en.wikipedia.org/wik en.wikipedia.org/wiki en.wikipedia.org/wiki/ en.wikipedia.org/wiki/p en.wikipedia.org/wiki/pr en.wikipedia.org/wiki/pro en.wikipedia.org/wiki/prod en.wikipedia.org/wiki/produ en.wikipedia.org/wiki/produc en.wikipedia.org/wiki/product en.wikipedia.org/wiki/producti en.wikipedia.org/wiki/productio en.wikipedia.org/wiki/production en.wikipedia.org/wiki/production_ en.wikipedia.org/wiki/production_c en.wikipedia.org/wiki/production_co en.wikipedia.org/wiki/production_cod * en.wikipedia.org/wiki/production_code // this is your document a match en.wikipedia.org/wiki/production_code/ en.wikipedia.org/wiki/production_code/t en.wikipedia.org/wiki/production_code/te en.wikipedia.org/wiki/production_code/tes en.wikipedia.org/wiki/production_code/test The none of the pieces of the query "en.wikipedia.org/wiki/test" will match your document. Hope this helps. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org