Re: Lucene search in URL

Anshum Sat, 19 Sep 2009 21:58:57 -0700

Hi Florian,
Perhaps you might run into issues with using an ngram. How I see it is that
you need tokenized urls and need to run an exact search using a keyword
tokenizer on the search string.
You could try this. I am assuming it'll work.
so something like
en.wikipedia.org/wiki/production_code/test
gets tokenized as
[en] [wikipedia] [org] [wiki[ [production_code] [test]


so an exact search for any set of subsequent (while maintaining the order)
would get you the result. And yes, you might want to look at your tokenizers
a little bit.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Sun, Sep 20, 2009 at 3:30 AM, AHMET ARSLAN <iori...@yahoo.com> wrote:

> > Dear List,
> >
> > I'm working on a project where i have to check a Blacklist
> > of URL's with Lucene. (about 500.000)
> > Is it possible to search for a URL in a hierarchical
> > context?
> >
> > for Example:
> > Blacklist entry: "en.wikipedia.org/wiki/production_code"
> >
> > "en.wikipedia.org/wiki/production_code/test" should match
> > "en.wikipedia.org/wiki/test" should not match
>
> If any substring (0 to n) of your query matches a document completely than
> that query should match, right? Thats what I understand from your examples.
>
> You can achieve this bu using two different analyzers for index and query
> time.
>
> query analyzer:
>
> KeywordTokenizer
> EdgeNGramTokenFilter (side = EdgeNGramTokenFilter.Side.FRONT , mingram = 1,
> maxgram=512)
>
> index analyzer:
>
> KeywordTokenizer
>
> The index analyzer comes out-of-the-box:
> org.apache.lucene.analysis.KeywordAnalyzer
> But you need to write query analyzer.
>
> If you want case-insensitive search you can add LowercaseFilter to both of
> your analyzers.
>
> By using this, your black list urls will be indexed verbatim. (one token)
>
> Your query "en.wikipedia.org/wiki/production_code/test"
> will be broken in to these pieces and one of them will match your document:
>
> e
> en
> en.
> en.w
> en.wi
> en.wik
> en.wiki
> en.wikip
> en.wikipe
> en.wikiped
> en.wikipedi
> en.wikipedia
> en.wikipedia.
> en.wikipedia.o
> en.wikipedia.or
> en.wikipedia.org
> en.wikipedia.org/
> en.wikipedia.org/w
> en.wikipedia.org/wi
> en.wikipedia.org/wik
> en.wikipedia.org/wiki
> en.wikipedia.org/wiki/
> en.wikipedia.org/wiki/p
> en.wikipedia.org/wiki/pr
> en.wikipedia.org/wiki/pro
> en.wikipedia.org/wiki/prod
> en.wikipedia.org/wiki/produ
> en.wikipedia.org/wiki/produc
> en.wikipedia.org/wiki/product
> en.wikipedia.org/wiki/producti
> en.wikipedia.org/wiki/productio
> en.wikipedia.org/wiki/production
> en.wikipedia.org/wiki/production_
> en.wikipedia.org/wiki/production_c
> en.wikipedia.org/wiki/production_co
> en.wikipedia.org/wiki/production_cod
> * en.wikipedia.org/wiki/production_code  // this is your document a match
> en.wikipedia.org/wiki/production_code/
> en.wikipedia.org/wiki/production_code/t
> en.wikipedia.org/wiki/production_code/te
> en.wikipedia.org/wiki/production_code/tes
> en.wikipedia.org/wiki/production_code/test
>
> The none of the pieces of the query "en.wikipedia.org/wiki/test" will
> match your document.
>
> Hope this helps.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Lucene search in URL

Reply via email to