Re: Untokenized URL

blazingwolf7 Sun, 06 Jul 2008 22:39:43 -0700

I am trying to retrieve the url and use it as filter. The main problem is I
don't want to use a reader to continuously retrieve the url for each
document located.


TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field, ""));
do{
   Term term = termEnum.term();
}while(termEnum.next());

I am using this code to retrieve the field containing the url but it is
tokenized. Is there anyway to untokenized it or is there a better way to do
this?


Shai Erera wrote:
> 
> I think that the simplest solution will be to index the URL field twice,
> once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
> un_tokenized term.
> If you have a document in hand and only want to fetch its URL, then add
> the
> URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
> COMPRESS and Index.NO.
> 
> Perhaps I don't understand the entire scenario. When do you need to fetch
> the contentLength and URL? To what purpose?
> 
> On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <[EMAIL PROTECTED]>
> wrote:
> 
>>
>> No, I didn't store the contentLength. Just adding it into the index.
>> Which
>> until now I am still scratching my head as I can't think of another way
>> to
>> retrieve it without continuously using the reader.
>>
>> As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED).
>> I
>> will like to keep it this way, having the url being tokenized. I am
>> finding
>> a way to UNtokenized it, I retrieved it using a method that will retrieve
>> the entire field then extract the information in it. But the problem is,
>> the
>> url are broken down. I am seeking a way to reconstruct it to its
>> orgininal
>> format. Can it be done?
>>
>>
>> Shai Erera wrote:
>> >
>> > Hi
>> >
>> > Regarding the contentLength, when you add it to the document, do you
>> use
>> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
>> >
>> > Regarding the URL, how do you add it to the document? For example, if
>> you
>> > do
>> > doc.add(new Field("url", "http://www.cnn.com";, Store.NO,
>> > Index.UN_TOKENIZED), it would create a token like "url:
>> http://www.cnn.com";
>> > without breaking it to its parts. Is that what you're looking for?
>> >
>> > Shai
>> >
>> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <[EMAIL PROTECTED]>
>> > wrote:
>> >
>> >>
>> >> Hi,
>> >>
>> >> I am currently working on retrieving url and contentLength of each
>> >> document
>> >> found during the search. I want to retrieve it during the calculation
>> of
>> >> score so that I can influence the score in some other way.
>> >>
>> >> I used the methods from TermDocs and TermEnum to get the information.
>> >> However, the url I retrieve as is know by most, is tokenized. It is
>> >> broken
>> >> down into several parts and I will have to rejoin them. Can anyone
>> help
>> >> me
>> >> with this? I am stuck here wondering how to get back the whole url
>> >> without
>> >> using a Reader.
>> >>
>> >> Also, I try to retrieve the contentLength, but the results return are
>> >> null.
>> >> Why is that? I opened the index using Luke and the contentLength is
>> there
>> >> but when I try to get it using this way, the results is null.
>> >>
>> >> Can anyone help me with both of these problems? Any help will be
>> >> appreciated. Thanks
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> >
>> > Shai Erera
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> -- 
> Regards,
> 
> Shai Erera
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Untokenized-URL-tp18275048p18310348.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Untokenized URL

Reply via email to