[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]
Enis Soztutar updated NUTCH-389:
--------------------------------
Attachment: urlTokenizer-improved.diff
This is an improvement and a minor bug fix over the previous url tokenizer.
This version first replaces characters, which are represented in hexadecimal
format in the urls.
For example the url "file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html" will
first be converted to "file:///tmp/foo baz bar/foo/baz~bar/index.html" by
replacing the %20 characters with the space.
A NullPointerException is corrected in case or input reader returning null for
the url.
Further improvements on the url tokenization can be discussed here.
> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
> Key: NUTCH-389
> URL: http://issues.apache.org/jira/browse/NUTCH-389
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 0.9.0
> Reporter: Enis Soztutar
> Priority: Minor
> Attachments: urlTokenizer-improved.diff, urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token
> seperators, which is in the case of the urls not appropriate. So i have
> written a url tokenizer which the tokens that match the regular exp
> [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html
> which describes the grammer for URIs, URL's can be tokenized with the above
> expression.
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the
> "url", "site" and "host" fields.
> see : http://www.mail-archive.com/[email protected]/msg06247.html
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira