[ 
https://issues.apache.org/jira/browse/FLUME-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489504#comment-13489504
 ] 

Nitin Verma commented on FLUME-1676:
------------------------------------

Hi Mike,

I did some testing on constructing java strings using iso-8859-1 bytes. As java 
string translates from given bytes to UTF-16, if charset is not correct then it 
is lossy. (default is UTF-8)

For flume we should ingest and egest bytes from strings using the charset so 
that channel get the same bytes as user source had, likewise the sink.

string = new String(bytes, charset);
string.getBytes(charset);

TODO: I would do similar tests on streams.

Java Test Code
{code:java}
package edu.nitin.testcodes;

import java.nio.charset.Charset;
import org.testng.annotations.Test;

public class CharsetTest {

    @Test
    public void testCharset() {
        final byte[] bytes = new byte[]{(byte) 0x40, (byte) 0xC2, (byte) 
0xE6,(byte) 0x40};
        final Charset charset = Charset.forName("ISO-8859-1");
        System.out.println("Input bytes");
        print(bytes);

        System.out.println("ingest using charset");
        {
            final String string = new String(bytes, charset);
            System.out.println(string);
            print(string.getBytes());
            print(string.getBytes(charset));
        }

        System.out.println("ingest without using charset");
        {
            final String string = new String(bytes);
            System.out.println(string);
            print(string.getBytes());
            print(string.getBytes(charset));
        }

    }

    private void print(final byte bytes[]) {
        for (byte b : bytes) {
            System.out.printf("  %02X", b);
        }
        System.out.println();
    }
}

{code}

Output
{code}
Input bytes
  40  C2  E6  40
ingest using charset
@Âæ@
  40  C3  82  C3  A6  40
  40  C2  E6  40
ingest without using charset
@��
  40  EF  BF  BD  EF  BF  BD
  40  3F  3F
{code}

                
> ExecSource should provide a configurable charset
> ------------------------------------------------
>
>                 Key: FLUME-1676
>                 URL: https://issues.apache.org/jira/browse/FLUME-1676
>             Project: Flume
>          Issue Type: Bug
>         Environment: :~/apache-flume-1.4.0-SNAPSHOT/conf# ../bin/flume-ng 
> version
> Flume 1.4.0-SNAPSHOT
> Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
> Revision: 831a86fc5501a8624b184ea65e53749df31692b8
> Compiled by jenkins on Tue Oct 30 03:18:08 UTC 2012
> From source with checksum 98685e32b9e500a2305f538b4468faaa
>            Reporter: Suresh Saggar
>
> The character set is currently not configurable in the exec source - 
> http://flume.apache.org/FlumeUserGuide.html#exec-source
> File - 
> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/source/ExecSource.java
> Can somebody please expose the ability to specify character set in the exec 
> source?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to