[ https://issues.apache.org/jira/browse/NUTCH-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinci updated NUTCH-625: ------------------------ Description: If the crawl db contains both utf-8 non-ascii character and non-utf-8 non-ascii character(i.e. multi-byte character), the dumped contents by readseg utility will have garbled character appear in all of the non-utf8 non-ascii text, and those texts are unable to repair by encoding reload. At the same time, the utf-8 text is normal, only the non-utf8 text broken. Any possible solution available for repairing the broken text? was: If the crawl db contains both utf-8 non-ascii character and non-utf-8 non-ascii character(i.e. multi-byte character), the dumped content will have garbled character appear in all of the non-utf8 non-ascii text, and those texts are unable to repair by encoding reload. At the same time, the utf-8 text is normal, only the non-utf8 text broken. Any possible solution available for repairing the broken text? Affects Version/s: 1.0.0 > Non-ascii character broken in dumped content for mixed encoding (utf-8 and > multi-byte) > -------------------------------------------------------------------------------------- > > Key: NUTCH-625 > URL: https://issues.apache.org/jira/browse/NUTCH-625 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.0.0 > Reporter: Vinci > > If the crawl db contains both utf-8 non-ascii character and non-utf-8 > non-ascii character(i.e. multi-byte character), the dumped contents by > readseg utility will have garbled character appear in all of the non-utf8 > non-ascii text, and those texts are unable to repair by encoding reload. > At the same time, the utf-8 text is normal, only the non-utf8 text broken. > Any possible solution available for repairing the broken text? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.