I think that the simplest solution will be to index the URL field twice, once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the un_tokenized term. If you have a document in hand and only want to fetch its URL, then add the URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES / COMPRESS and Index.NO.
Perhaps I don't understand the entire scenario. When do you need to fetch the contentLength and URL? To what purpose? On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <[EMAIL PROTECTED]> wrote: > > No, I didn't store the contentLength. Just adding it into the index. Which > until now I am still scratching my head as I can't think of another way to > retrieve it without continuously using the reader. > > As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). I > will like to keep it this way, having the url being tokenized. I am finding > a way to UNtokenized it, I retrieved it using a method that will retrieve > the entire field then extract the information in it. But the problem is, > the > url are broken down. I am seeking a way to reconstruct it to its orgininal > format. Can it be done? > > > Shai Erera wrote: > > > > Hi > > > > Regarding the contentLength, when you add it to the document, do you use > > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)? > > > > Regarding the URL, how do you add it to the document? For example, if you > > do > > doc.add(new Field("url", "http://www.cnn.com", Store.NO, > > Index.UN_TOKENIZED), it would create a token like "url: > http://www.cnn.com" > > without breaking it to its parts. Is that what you're looking for? > > > > Shai > > > > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <[EMAIL PROTECTED]> > > wrote: > > > >> > >> Hi, > >> > >> I am currently working on retrieving url and contentLength of each > >> document > >> found during the search. I want to retrieve it during the calculation of > >> score so that I can influence the score in some other way. > >> > >> I used the methods from TermDocs and TermEnum to get the information. > >> However, the url I retrieve as is know by most, is tokenized. It is > >> broken > >> down into several parts and I will have to rejoin them. Can anyone help > >> me > >> with this? I am stuck here wondering how to get back the whole url > >> without > >> using a Reader. > >> > >> Also, I try to retrieve the contentLength, but the results return are > >> null. > >> Why is that? I opened the index using Luke and the contentLength is > there > >> but when I try to get it using this way, the results is null. > >> > >> Can anyone help me with both of these problems? Any help will be > >> appreciated. Thanks > >> -- > >> View this message in context: > >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html > >> Sent from the Lucene - Java Developer mailing list archive at > Nabble.com. > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > > -- > > Regards, > > > > Shai Erera > > > > > > -- > View this message in context: > http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera