Liyu Yi created IO-354:
--------------------------
Summary: Commons IO Tailer does not respect UTF-8 Charset
Key: IO-354
URL: https://issues.apache.org/jira/browse/IO-354
Project: Commons IO
Issue Type: Bug
Components: Utilities
Affects Versions: 2.3
Environment: JDK 7
RHEL Linux
Apache Commons IO version 2.4
Reporter: Liyu Yi
I just realized there is a defect in the source code of
"org.apache.commons.io.input.Tailer.java". Basically, the current
implementation does not work for multi-byte encoded files. See the following
snippet,
448 private long readLines(RandomAccessFile reader) throws IOException {
449 StringBuilder sb = new StringBuilder();
450
451 long pos = reader.getFilePointer();
452 long rePos = pos; // position to re-read
453
454 int num;
455 boolean seenCR = false;
456 while (run && ((num = reader.read(inbuf)) != -1)) {
457 for (int i = 0; i < num; i++) {
458 byte ch = inbuf[i];
459 switch (ch) {
460 case '\n':
461 seenCR = false; // swallow CR before LF
462 listener.handle(sb.toString());
463 sb.setLength(0);
464 rePos = pos + i + 1;
465 break;
466 case '\r':
467 if (seenCR) {
468 sb.append('\r');
469 }
470 seenCR = true;
471 break;
472 default:
473 if (seenCR) {
474 seenCR = false; // swallow final CR
475 listener.handle(sb.toString());
476 sb.setLength(0);
477 rePos = pos + i + 1;
478 }
479 sb.append((char) ch); // add character, not its ascii
value
480 }
481 }
482
483 pos = reader.getFilePointer();
484 }
485
486 reader.seek(rePos); // Ensure we can re-read if necessary
487 return rePos;
488 }
At line 479, the conversion of byte to char types breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira