[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]
Enis Soztutar updated NUTCH-389:
--------------------------------
Description:
NutchAnalysis.jj tokenizes the input by threating & and _ as non token
seperators, which is in the case of the urls not appropriate. So i have written
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes
the grammer for URIs, URL's can be tokenized with the above expression.
NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the "url",
"site" and "host" fields.
see : http://www.mail-archive.com/[email protected]/msg06247.html
was:
NutchAnalysis.jj tokenizes the input by threating & and _ as non token
seperators, which is in the case of the urls not appropriate. So i have written
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes
the grammer for URIs, URL's can be tokenized with the above expression.
see : http://www.mail-archive.com/[email protected]/msg06247.html
> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
> Key: NUTCH-389
> URL: http://issues.apache.org/jira/browse/NUTCH-389
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 0.9.0
> Reporter: Enis Soztutar
> Priority: Minor
> Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token
> seperators, which is in the case of the urls not appropriate. So i have
> written a url tokenizer which the tokens that match the regular exp
> [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html
> which describes the grammer for URIs, URL's can be tokenized with the above
> expression.
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the
> "url", "site" and "host" fields.
> see : http://www.mail-archive.com/[email protected]/msg06247.html
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira