Are you asking how to get lucene.apache.org out of
http://lucene.apache.org/ or how to get apache.org out of
lucene.apache.org?  The getHost() method of java.net.URL will give you
the former. Or use a regexp.  I don't know an easy way to do the
latter, but depending on your requirements you could split
lucene.apache.org into tokens "lucene.apache.org" and "apache.org" and
"org" and index all of them.  You probably want to use an analyzer
that doesn't split on the . character.


--
Ian.


On Sat, Jan 30, 2010 at 12:12 AM, Franz Allan Valencia See
<franz....@gmail.com> wrote:
> How should I go about identifying the domain?
>
> Thanks,
>
> --
> Franz Allan Valencia See | Java Software Engineer
> franz....@gmail.com
> LinkedIn: http://www.linkedin.com/in/franzsee
> Twitter: http://www.twitter.com/franz_see
>
> On Fri, Jan 29, 2010 at 6:42 PM, Ian Lea <ian....@gmail.com> wrote:
>
>> Instead of playing around with tf/idf, how about just indexing and
>> searching the domain.
>>
>>
>> --
>> Ian.
>>
>>
>> On Fri, Jan 29, 2010 at 3:43 AM, Franz Allan Valencia See
>> <franz....@gmail.com> wrote:
>> > Good day,
>> >
>> > I am currently using lucene for my searches. And one of the problems that
>> Im
>> > facing is when keyword is a url. The tokens such as http, https, ://,
>> index,
>> > html, etc seems to be messing up with our search results. The focus was
>> > supposed to be only on the url domain.
>> >
>> > The idea that I have is modify the idf so that rare terms get boosted
>> much
>> > more than the default settings in lucene. Since there are probably a lot
>> of
>> > http, https://, etc, then matches to these terms should be really really
>> > low, while matches to the domain (which is rare) should be high.
>> >
>> > Would this work or am I totally misunderstanding lucene's tf/idf? :-)
>> >
>> > Thanks,
>> >
>> > --
>> > Franz Allan Valencia See | Java Software Engineer
>> > franz....@gmail.com
>> > LinkedIn: http://www.linkedin.com/in/franzsee
>> > Twitter: http://www.twitter.com/franz_see
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to