a url tokenizer implementation for tokenizing index fields : url and host
--------------------------------------------------------------------------
Key: NUTCH-389
URL: http://issues.apache.org/jira/browse/NUTCH-389
Project: Nutch
Issue Type: Improvement
Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
NutchAnalysis.jj tokenizes the input by threating & and _ as non token
seperators, which is in the case of the urls not appropriate. So i have written
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes
the grammer for URIs, URL's can be tokenized with the above expression.
see : http://www.mail-archive.com/[email protected]/msg06247.html
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira