[
https://issues.apache.org/jira/browse/FLUME-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489312#comment-13489312
]
Nitin Verma commented on FLUME-1676:
------------------------------------
There are few questions around this request.
Before that I would like to explain a bit about two charsets under
consideration.
Suppose we need to write a²=¼b in ISO-8859-1
(http://en.wikipedia.org/wiki/ISO/IEC_8859-1).
1. a,b,= fall in ASCII range, thus you can type
2. ² = B2, ¼ = BC in hex.
$ awk ' BEGIN { printf "a%s=%sb\n", "\xB2", "\xBC" } '
a�=�b
Note: If this shows up as a²=¼b, then you are on ISO-8859-1.
Now let us encode the same in UTF-8 (http://en.wikipedia.org/wiki/UTF-8)
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
and so on so forth
The hex values for the chars is same in UTF-8 but it has to be encoded it is
not a single byte charset (² = B2, ¼ = BC )
As B2 & BC > 7F and < 0800, it would be encoded in two bytes (110xxxxx 10xxxxxx)
B2 => 1011 0010 => 1100 0010 1011 0010 => C2 B2
B2 => 1011 1100 => 1100 0010 1011 1100 => C2 BC
$ awk ' BEGIN { printf "a%s=%sb\n", "\xC2\xB2", "\xC2\xBC" } '
a²=¼b
Note: If this shows up as a²=¼b, then you are on ISO-8859-1.
iconv tries to makes sure it translates bytes in such a way that from-charset
is visible on to-charset terminal.
Thus it would add C2, if I do the following.
$ awk ' BEGIN { printf "a%s=%sb\n", "\xB2", "\xBC" } ' | iconv -f "ISO-8859-1"
-t "UTF-8"
a²=¼b
Warning:
There are many charsets around and not all charsets support all the characters.
Thereby Byte translation is a lossy business. Example below:-
$ awk ' BEGIN { print "\xE0\xA5\x90" } ' | iconv -f "UTF-8" -t "ISO-8859-1"
iconv: illegal input sequence at position 0
Considering all above, I feel
Flume should concentrate on transferring byte to byte from one system to
another, not translating. If the charset of two systems is different, then
source system: cat $file
sink system: cat $file | iconv -f source-charset -t sink-charset
should show the same visible output, till sink-charset defines all the
characters defined in source-charset.
** Only guarantee flume should give is bytes transferred on sink are the same
as the bytes given via the source **
> ExecSource should provide a configurable charset
> ------------------------------------------------
>
> Key: FLUME-1676
> URL: https://issues.apache.org/jira/browse/FLUME-1676
> Project: Flume
> Issue Type: Bug
> Environment: :~/apache-flume-1.4.0-SNAPSHOT/conf# ../bin/flume-ng
> version
> Flume 1.4.0-SNAPSHOT
> Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
> Revision: 831a86fc5501a8624b184ea65e53749df31692b8
> Compiled by jenkins on Tue Oct 30 03:18:08 UTC 2012
> From source with checksum 98685e32b9e500a2305f538b4468faaa
> Reporter: Suresh Saggar
>
> The character set is currently not configurable in the exec source -
> http://flume.apache.org/FlumeUserGuide.html#exec-source
> File -
> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/source/ExecSource.java
> Can somebody please expose the ability to specify character set in the exec
> source?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira