David Gao created IO-331:
----------------------------
Summary: BOMInputStream wrongly detects UTF-32LE_BOM files as
UTF-16LE_BOM files in method getBOM()
Key: IO-331
URL: https://issues.apache.org/jira/browse/IO-331
Project: Commons IO
Issue Type: Bug
Components: Streams/Writers
Affects Versions: 2.3
Environment: OS: Win 7 x64
JDK: 1.7.03
Reporter: David Gao
Hi,
The BOMInputStream works great for most UTF encoded files when detecting Byte
Order Marks. However, if a file is UTF-32LE encoded with BOM the class takes it
as UTF-16LE instead. This is not expected behavior.
The problem comes from method getBOM(). And the first two bytes for UTF-16LE
and UTF-32LE are the same, which might be the root cause of the problem.
The following lists the bytes for UTF encodings for reference. The content is a
BOM followed by letter 't'.
||Encoding||Byte 1||Byte 2||Byte 3||Byte 4|| || || ||
|UTF8|EF|BB|BF|74| | | |
|UTF16-LE|FF|FE|74|00| | | |
|UTF16-BE|FE|FF|00|74| | | |
|UTF32-LE|FF|FE|00|00|74|00|00|00
|UTF32-BE|00|00|FE|FF|00|00|00|74
I personally used the following code to work around this problem at the moment.
Hope it helps.
{code}
private void detectBOM(InputStream in) throws IOException{
List<ByteOrderMark> all=availableBOMs();
int max=0;
for (ByteOrderMark bom : all) {
max = Math.max(max, bom.length());
}
byte[] firstBytes=new byte[max];
for (int i = 0; i < max; i++) {
firstBytes[i]=(byte) in.read();
System.out.print(Integer.toHexString(firstBytes[i] &
0xff).toUpperCase()+" ");
}
boolean found=false;
for (int j = max; j >1; j--) {
byte[] _copy=Arrays.copyOf(firstBytes, j);
for (ByteOrderMark mark : all) {
found=Arrays.equals(_copy, mark.getBytes());
if (found) {
System.out.println("\nBOM is:
"+mark.getCharsetName());
break;
}
}
if (found) break;
}
}
private static List<ByteOrderMark> availableBOMs(){
List<ByteOrderMark> all=new ArrayList<ByteOrderMark>();
all.add(ByteOrderMark.UTF_8);
all.add(ByteOrderMark.UTF_16BE);
all.add(ByteOrderMark.UTF_16LE);
all.add(ByteOrderMark.UTF_32BE);
all.add(ByteOrderMark.UTF_32LE);
return all;
}
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira