Bug in TextParser with encoding
-------------------------------
Key: NUTCH-632
URL: https://issues.apache.org/jira/browse/NUTCH-632
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 0.9.0
Environment: Any
Reporter: Antony Bowesman
If a Content object is created with the following Content-Type: text/plain;
charset="windows-1251"
the Content object discards the charset parameter. As a result, when the
TextParser calls
String encoding = StringUtil.parseCharacterEncoding(content.getContentType());
it always gets null because the contentType stored in the Content object no
longer contains the charset string. The code has changed a lot from 0.9, so I
am not sure if this is still a problem, but I made a fix that simply saves
charset in Content with
if (this.contentType.startsWith("text/"))
this.charset = StringUtil.parseCharacterEncoding(contentType);
and TextParser just calls
String encoding = content.getCharset();
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.