[jira] [Commented] (FLUME-1676) ExecSource should provide a configurable charset

Nitin Verma (JIRA) Fri, 02 Nov 2012 01:53:18 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489312#comment-13489312
 ]


Nitin Verma commented on FLUME-1676:
------------------------------------

There are few questions around this request.

Before that I would like to explain a bit about two charsets under 
consideration.

Suppose we need to write a²=¼b in ISO-8859-1 
(http://en.wikipedia.org/wiki/ISO/IEC_8859-1).
1. a,b,= fall in ASCII range, thus you can type
2. ² = B2, ¼ = BC in hex.

$ awk ' BEGIN { printf "a%s=%sb\n", "\xB2", "\xBC" } ' 
a�=�b

Note: If this shows up as a²=¼b, then you are on ISO-8859-1.

Now let us encode the same in UTF-8 (http://en.wikipedia.org/wiki/UTF-8)

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   and so on so forth

The hex values for the chars is same in UTF-8 but it has to be encoded it is 
not a single byte charset (² = B2, ¼ = BC )

As B2 & BC > 7F and < 0800, it would be encoded in two bytes (110xxxxx 10xxxxxx)
B2 => 1011 0010 => 1100 0010 1011 0010 => C2 B2
B2 => 1011 1100 => 1100 0010 1011 1100 => C2 BC

$ awk ' BEGIN { printf "a%s=%sb\n", "\xC2\xB2", "\xC2\xBC" } ' 
a²=¼b

Note: If this shows up as aÂ²=Â¼b, then you are on ISO-8859-1.

iconv tries to makes sure it translates bytes in such a way that from-charset 
is visible on to-charset terminal.

Thus it would add C2, if I do the following.

$ awk ' BEGIN { printf "a%s=%sb\n", "\xB2", "\xBC" } ' | iconv -f "ISO-8859-1" 
-t "UTF-8"
a²=¼b

Warning:
There are many charsets around and not all charsets support all the characters. 
Thereby Byte translation is a lossy business. Example below:-
$ awk ' BEGIN { print "\xE0\xA5\x90" } ' | iconv -f "UTF-8" -t "ISO-8859-1"
iconv: illegal input sequence at position 0

Considering all above, I feel

Flume should concentrate on transferring byte to byte from one system to 
another, not translating. If the charset of two systems is different, then
source system: cat $file
sink system: cat $file | iconv -f source-charset -t sink-charset
should show the same visible output, till sink-charset defines all the 
characters defined in source-charset.

** Only guarantee flume should give is bytes transferred on sink are the same 
as the bytes given via the source **




                
> ExecSource should provide a configurable charset
> ------------------------------------------------
>
>                 Key: FLUME-1676
>                 URL: https://issues.apache.org/jira/browse/FLUME-1676
>             Project: Flume
>          Issue Type: Bug
>         Environment: :~/apache-flume-1.4.0-SNAPSHOT/conf# ../bin/flume-ng 
> version
> Flume 1.4.0-SNAPSHOT
> Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
> Revision: 831a86fc5501a8624b184ea65e53749df31692b8
> Compiled by jenkins on Tue Oct 30 03:18:08 UTC 2012
> From source with checksum 98685e32b9e500a2305f538b4468faaa
>            Reporter: Suresh Saggar
>
> The character set is currently not configurable in the exec source - 
> http://flume.apache.org/FlumeUserGuide.html#exec-source
> File - 
> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/source/ExecSource.java
> Can somebody please expose the ability to specify character set in the exec 
> source?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1676) ExecSource should provide a configurable charset

Reply via email to