Hmm.... My Analyzer is a Dictionary-based Analyzer. And so, it only recognizes tokens in its dictionary. Adding every url (or domain) is not a viable solution.
So how could I include that to my analyzer? Lucene Filter? FilterReader? Thanks, -- Franz Allan Valencia See | Java Software Engineer franz....@gmail.com LinkedIn: http://www.linkedin.com/in/franzsee Twitter: http://www.twitter.com/franz_see On Sun, Jan 31, 2010 at 1:58 AM, Ian Lea <ian....@gmail.com> wrote: > Are you asking how to get lucene.apache.org out of > http://lucene.apache.org/ or how to get apache.org out of > lucene.apache.org? The getHost() method of java.net.URL will give you > the former. Or use a regexp. I don't know an easy way to do the > latter, but depending on your requirements you could split > lucene.apache.org into tokens "lucene.apache.org" and "apache.org" and > "org" and index all of them. You probably want to use an analyzer > that doesn't split on the . character. > > > -- > Ian. > > > On Sat, Jan 30, 2010 at 12:12 AM, Franz Allan Valencia See > <franz....@gmail.com> wrote: > > How should I go about identifying the domain? > > > > Thanks, > > > > -- > > Franz Allan Valencia See | Java Software Engineer > > franz....@gmail.com > > LinkedIn: http://www.linkedin.com/in/franzsee > > Twitter: http://www.twitter.com/franz_see > > > > On Fri, Jan 29, 2010 at 6:42 PM, Ian Lea <ian....@gmail.com> wrote: > > > >> Instead of playing around with tf/idf, how about just indexing and > >> searching the domain. > >> > >> > >> -- > >> Ian. > >> > >> > >> On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See > >> <franz....@gmail.com> wrote: > >> > Good day, > >> > > >> > I am currently using lucene for my searches. And one of the problems > that > >> Im > >> > facing is when keyword is a url. The tokens such as http, https, ://, > >> index, > >> > html, etc seems to be messing up with our search results. The focus > was > >> > supposed to be only on the url domain. > >> > > >> > The idea that I have is modify the idf so that rare terms get boosted > >> much > >> > more than the default settings in lucene. Since there are probably a > lot > >> of > >> > http, https://, etc, then matches to these terms should be really > really > >> > low, while matches to the domain (which is rare) should be high. > >> > > >> > Would this work or am I totally misunderstanding lucene's tf/idf? :-) > >> > > >> > Thanks, > >> > > >> > -- > >> > Franz Allan Valencia See | Java Software Engineer > >> > franz....@gmail.com > >> > LinkedIn: http://www.linkedin.com/in/franzsee > >> > Twitter: http://www.twitter.com/franz_see > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >