Bug in TextParser with encoding
-------------------------------

                 Key: NUTCH-632
                 URL: https://issues.apache.org/jira/browse/NUTCH-632
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 0.9.0
         Environment: Any
            Reporter: Antony Bowesman


If a Content object is created with the following Content-Type: text/plain; 
charset="windows-1251"

the Content object discards the charset parameter.  As a result, when the 
TextParser calls

String encoding = StringUtil.parseCharacterEncoding(content.getContentType());

it always gets null because the contentType stored in the Content object no 
longer contains the charset string.  The code has changed a lot from 0.9, so I 
am not sure if this is still a problem, but I made a fix that simply saves 
charset in Content with

    if (this.contentType.startsWith("text/"))
        this.charset = StringUtil.parseCharacterEncoding(contentType);

and TextParser just calls

    String encoding = content.getCharset();



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to