[ 
https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-161:
-----------------------------

    Fix Version/s: 1.0.0
         Assignee: Sami Siren
          Summary: Change Plain text parser to use 
parser.character.encoding.default property for fall back encoding  (was: Plain 
text parser should use parser.character.encoding.default property for fall back 
encoding)

> Change Plain text parser to use parser.character.encoding.default property 
> for fall back encoding
> -------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-161
>                 URL: https://issues.apache.org/jira/browse/NUTCH-161
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: any
>            Reporter: KuroSaka TeruHiko
>         Assigned To: Sami Siren
>            Priority: Minor
>             Fix For: 1.0.0
>
>
> The value of the property parser.character.encoding.default is used as a 
> fallback character encoding (charset) when HTML parser cannot find the 
> charset information in HTTP Content-Type header or in META HTTP-EQUIV tag.  
> But the plain text parser behaves differently.  It just uses the system 
> encoding (Java VM file.encodings, which in turn derives from the OS and the 
> locale of the environment from which the JVM was spawned).  This is not 
> pretty.  To gurantee a consistent behavior, plain text parser should use the 
> value of the same property.
> Though not tested, these changes in 
> ./src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java 
> should do it:
> Insert this statement in the class definition:
>   private static String defaultCharEncoding =
>     NutchConf.get().get("parser.character.encoding.default", "windows-1252");
> Replace this:
>       text = new String(content.getContent());    // use default encoding
> with this:
>       text = new String(content.getContent(), defaultCharEncoding );    // 
> use default encoding

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to