[jira] [Created] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Liyu Yi (JIRA) Fri, 26 Oct 2012 17:09:17 -0700

Liyu Yi created IO-354:
--------------------------

             Summary: Commons IO Tailer does not respect UTF-8 Charset
                 Key: IO-354
                 URL: https://issues.apache.org/jira/browse/IO-354
             Project: Commons IO
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 2.3
         Environment: JDK 7 
RHEL Linux
Apache Commons IO version 2.4
            Reporter: Liyu Yi



I just realized there is a defect in the source code of 
"org.apache.commons.io.input.Tailer.java". Basically, the current 
implementation does not work for multi-byte encoded files. See the following 
snippet,

448    private long readLines(RandomAccessFile reader) throws IOException {
449        StringBuilder sb = new StringBuilder();
450
451        long pos = reader.getFilePointer();
452        long rePos = pos; // position to re-read
453
454        int num;
455        boolean seenCR = false;
456        while (run && ((num = reader.read(inbuf)) != -1)) {
457            for (int i = 0; i < num; i++) {
458                byte ch = inbuf[i];
459                switch (ch) {
460                case '\n':
461                    seenCR = false; // swallow CR before LF
462                    listener.handle(sb.toString());
463                    sb.setLength(0);
464                    rePos = pos + i + 1;
465                    break;
466                case '\r':
467                    if (seenCR) {
468                        sb.append('\r');
469                    }
470                    seenCR = true;
471                    break;
472                default:
473                    if (seenCR) {
474                        seenCR = false; // swallow final CR
475                        listener.handle(sb.toString());
476                        sb.setLength(0);
477                        rePos = pos + i + 1;
478                    }
479                    sb.append((char) ch); // add character, not its ascii 
value
480                }
481            }
482
483            pos = reader.getFilePointer();
484        }
485
486        reader.seek(rePos); // Ensure we can re-read if necessary
487        return rePos;
488    }

At line 479, the conversion of byte to char types breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Reply via email to