I am trying to retrieve the url and use it as filter. The main problem is I don't want to use a reader to continuously retrieve the url for each document located.
TermDocs termDocs = reader.termDocs(); TermEnum termEnum = reader.terms (new Term (field, "")); do{ Term term = termEnum.term(); }while(termEnum.next()); I am using this code to retrieve the field containing the url but it is tokenized. Is there anyway to untokenized it or is there a better way to do this? Shai Erera wrote: > > I think that the simplest solution will be to index the URL field twice, > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the > un_tokenized term. > If you have a document in hand and only want to fetch its URL, then add > the > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES / > COMPRESS and Index.NO. > > Perhaps I don't understand the entire scenario. When do you need to fetch > the contentLength and URL? To what purpose? > > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <[EMAIL PROTECTED]> > wrote: > >> >> No, I didn't store the contentLength. Just adding it into the index. >> Which >> until now I am still scratching my head as I can't think of another way >> to >> retrieve it without continuously using the reader. >> >> As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). >> I >> will like to keep it this way, having the url being tokenized. I am >> finding >> a way to UNtokenized it, I retrieved it using a method that will retrieve >> the entire field then extract the information in it. But the problem is, >> the >> url are broken down. I am seeking a way to reconstruct it to its >> orgininal >> format. Can it be done? >> >> >> Shai Erera wrote: >> > >> > Hi >> > >> > Regarding the contentLength, when you add it to the document, do you >> use >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)? >> > >> > Regarding the URL, how do you add it to the document? For example, if >> you >> > do >> > doc.add(new Field("url", "http://www.cnn.com", Store.NO, >> > Index.UN_TOKENIZED), it would create a token like "url: >> http://www.cnn.com" >> > without breaking it to its parts. Is that what you're looking for? >> > >> > Shai >> > >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <[EMAIL PROTECTED]> >> > wrote: >> > >> >> >> >> Hi, >> >> >> >> I am currently working on retrieving url and contentLength of each >> >> document >> >> found during the search. I want to retrieve it during the calculation >> of >> >> score so that I can influence the score in some other way. >> >> >> >> I used the methods from TermDocs and TermEnum to get the information. >> >> However, the url I retrieve as is know by most, is tokenized. It is >> >> broken >> >> down into several parts and I will have to rejoin them. Can anyone >> help >> >> me >> >> with this? I am stuck here wondering how to get back the whole url >> >> without >> >> using a Reader. >> >> >> >> Also, I try to retrieve the contentLength, but the results return are >> >> null. >> >> Why is that? I opened the index using Luke and the contentLength is >> there >> >> but when I try to get it using this way, the results is null. >> >> >> >> Can anyone help me with both of these problems? Any help will be >> >> appreciated. Thanks >> >> -- >> >> View this message in context: >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html >> >> Sent from the Lucene - Java Developer mailing list archive at >> Nabble.com. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> >> > >> > >> > -- >> > Regards, >> > >> > Shai Erera >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Regards, > > Shai Erera > > -- View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18310348.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]