> Dear List,
>
> I'm working on a project where i have to check a Blacklist
> of URL's with Lucene. (about 500.000)
> Is it possible to search for a URL in a hierarchical
> context?
>
> for Example:
> Blacklist entry: "en.wikipedia.org/wiki/production_code"
>
> "en.wikipedia.org/wiki/production_code/test" should match
> "en.wikipedia.org/wiki/test" should not match
If any substring (0 to n) of your query matches a document completely than that
query should match, right? Thats what I understand from your examples.
You can achieve this bu using two different analyzers for index and query time.
query analyzer:
KeywordTokenizer
EdgeNGramTokenFilter (side = EdgeNGramTokenFilter.Side.FRONT , mingram = 1,
maxgram=512)
index analyzer:
KeywordTokenizer
The index analyzer comes out-of-the-box:
org.apache.lucene.analysis.KeywordAnalyzer
But you need to write query analyzer.
If you want case-insensitive search you can add LowercaseFilter to both of your
analyzers.
By using this, your black list urls will be indexed verbatim. (one token)
Your query "en.wikipedia.org/wiki/production_code/test"
will be broken in to these pieces and one of them will match your document:
e
en
en.
en.w
en.wi
en.wik
en.wiki
en.wikip
en.wikipe
en.wikiped
en.wikipedi
en.wikipedia
en.wikipedia.
en.wikipedia.o
en.wikipedia.or
en.wikipedia.org
en.wikipedia.org/
en.wikipedia.org/w
en.wikipedia.org/wi
en.wikipedia.org/wik
en.wikipedia.org/wiki
en.wikipedia.org/wiki/
en.wikipedia.org/wiki/p
en.wikipedia.org/wiki/pr
en.wikipedia.org/wiki/pro
en.wikipedia.org/wiki/prod
en.wikipedia.org/wiki/produ
en.wikipedia.org/wiki/produc
en.wikipedia.org/wiki/product
en.wikipedia.org/wiki/producti
en.wikipedia.org/wiki/productio
en.wikipedia.org/wiki/production
en.wikipedia.org/wiki/production_
en.wikipedia.org/wiki/production_c
en.wikipedia.org/wiki/production_co
en.wikipedia.org/wiki/production_cod
* en.wikipedia.org/wiki/production_code // this is your document a match
en.wikipedia.org/wiki/production_code/
en.wikipedia.org/wiki/production_code/t
en.wikipedia.org/wiki/production_code/te
en.wikipedia.org/wiki/production_code/tes
en.wikipedia.org/wiki/production_code/test
The none of the pieces of the query "en.wikipedia.org/wiki/test" will match
your document.
Hope this helps.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]