[ https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated NUTCH-541: ------------------------------------ Fix Version/s: (was: 1.1) - pushing this out per http://bit.ly/c7tBv9 > Index url field untokenized > --------------------------- > > Key: NUTCH-541 > URL: https://issues.apache.org/jira/browse/NUTCH-541 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher > Affects Versions: 1.0.0 > Reporter: Enis Soztutar > Assignee: Enis Soztutar > > Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the > untokenized version of the url field in some contexts : > 1. For deleting duplicates by url (at search time). see NUTCH-455 > 2. For restricting the search to a certain url (may be used in the case of > RSS search where each entry in the Rss is added as a distinct document with > (possibly) same url ) > query-url extends FieldQueryFilter so: > Query: url:http://www.apache.org/ > Parsed: url:"http http-www http-www-apache www www-apache apache org" > Translated: +url:"http-http-www http-www-http-www-apache > http-www-apache-www www-www-apache www-apache apache org" > 3. for accessing a document(s) in the search servers in the search servers. > (using query plugin) > I suggest we add url as in index-basic and implement a query-url-untoken > plugin. > doc.add(new Field("url", url.toString(), Field.Store.YES, > Field.Index.TOKENIZED)); > doc.add(new Field("url_untoken", url.toString(), Field.Store.NO, > Field.Index.UN_TOKENIZED)); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.