[
https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13633510#comment-13633510
]
Sebb commented on IO-354:
-------------------------
Thanks for the patch.
There's a minor issue with the patch, which is that the conversion from bytes
to String relies on the default encoding.
This was probably true of the original implementation.
Perhaps the class needs to support methods which take an encoding parameter?
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
> Key: IO-354
> URL: https://issues.apache.org/jira/browse/IO-354
> Project: Commons IO
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.3
> Environment: JDK 7
> RHEL Linux
> Apache Commons IO version 2.4
> Reporter: Liyu Yi
> Labels: Charset, Encoding, Tailer
> Attachments: Tailer-commonsio-354.patch
>
>
> I just realized there is a defect in the source code of
> "org.apache.commons.io.input.Tailer.java". Basically, the current
> implementation does not work for multi-byte encoded files. See the following
> snippet,
> 448 private long readLines(RandomAccessFile reader) throws IOException {
> 449 StringBuilder sb = new StringBuilder();
> 450
> 451 long pos = reader.getFilePointer();
> 452 long rePos = pos; // position to re-read
> 453
> 454 int num;
> 455 boolean seenCR = false;
> 456 while (run && ((num = reader.read(inbuf)) != -1)) {
> 457 for (int i = 0; i < num; i++) {
> 458 byte ch = inbuf[i];
> 459 switch (ch) {
> 460 case '\n':
> 461 seenCR = false; // swallow CR before LF
> 462 listener.handle(sb.toString());
> 463 sb.setLength(0);
> 464 rePos = pos + i + 1;
> 465 break;
> 466 case '\r':
> 467 if (seenCR) {
> 468 sb.append('\r');
> 469 }
> 470 seenCR = true;
> 471 break;
> 472 default:
> 473 if (seenCR) {
> 474 seenCR = false; // swallow final CR
> 475 listener.handle(sb.toString());
> 476 sb.setLength(0);
> 477 rePos = pos + i + 1;
> 478 }
> 479 sb.append((char) ch); // add character, not its ascii
> value
> 480 }
> 481 }
> 482
> 483 pos = reader.getFilePointer();
> 484 }
> 485
> 486 reader.seek(rePos); // Ensure we can re-read if necessary
> 487 return rePos;
> 488 }
> At line 479, the conversion of byte to char type breaks the encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira