Index url field untokenized
---------------------------

                 Key: NUTCH-541
                 URL: https://issues.apache.org/jira/browse/NUTCH-541
             Project: Nutch
          Issue Type: New Feature
          Components: indexer, searcher
    Affects Versions: 1.0.0
            Reporter: Enis Soztutar
            Assignee: Enis Soztutar
             Fix For: 1.0.0


Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
untokenized version of the url field in some contexts : 
1. For deleting duplicates by url (at search time). see NUTCH-455
2. For restricting the search to a certain url (may be used in the case of RSS 
search where each entry in the Rss is added as a distinct document with 
(possibly) same url ) 
   query-url extends FieldQueryFilter so: 
    Query: url:http://www.apache.org/
    Parsed: url:"http http-www http-www-apache www www-apache apache org"
    Translated: +url:"http-http-www http-www-http-www-apache 
http-www-apache-www www-www-apache www-apache apache org"
3. for accessing a document(s) in the search servers in the search servers. 
(using query plugin)

I suggest we add url as in index-basic and implement a query-url-untoken 
plugin. 
doc.add(new Field("url", url.toString(), Field.Store.YES, 
Field.Index.TOKENIZED));
doc.add(new Field("url_untoken", url.toString(), Field.Store.NO, 
Field.Index.UN_TOKENIZED));


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to