[ http://issues.apache.org/jira/browse/NUTCH-145?page=all ]

Sami Siren closed NUTCH-145.
----------------------------


> build of war file fails on Chinese (zh) .xml files due to UTF-8 BOM
> -------------------------------------------------------------------
>
>                 Key: NUTCH-145
>                 URL: http://issues.apache.org/jira/browse/NUTCH-145
>             Project: Nutch
>          Issue Type: Bug
>          Components: web gui
>    Affects Versions: 0.8
>         Environment: Windows XP, Cygwin, Eclipse, JDK 1.4.1
>            Reporter: KuroSaka TeruHiko
>         Assigned To: Sami Siren
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: NUTCH-145-fix.zip
>
>
> When I ran ant build from within Eclipse, it failed on 
> src/web/include/zh/header.xml and src/web/pages/zh/*.xml because "document 
> does not h ave a root element" (translated from Japanese message).
> At a closer look at these files, they have an invisible Unicode UTF-8 BOM 
> character, that is EF BB BF in hex, or \357\273\277 in octal, at the 
> beginning.
> Perhaps JDK 1.4.x UTF-8 converter does not handle the BOM for UTF-8 files. 
> (Note that BOM was orginially intended to be used to UTF-16 and UTF-32 
> encodings to self-identify the endianness.  But Microsoft started using 
> UTF-8-ized BOM as a character encoding signature.)
> Also noticed was, they use MS-DOS style end-of-line sequence, CR followed by 
> LF, unlike other ??/*.xml files which use UNIX style EOL.
> Fixed files are available.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to