[
https://issues.apache.org/jira/browse/SPARK-29160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jungtaek Lim updated SPARK-29160:
---------------------------------
Description:
This issue is from observation by [~vanzin] :
[https://github.com/apache/spark/pull/25670#discussion_r325383512]
Quoting his comment here:
{quote}
This is a long standing bug in the original code, but this should be explicitly
setting the charset to UTF-8 (using new PrintWriter(new
OutputStreamWriter(...)).
The reader side should too, although doing that now could potentially break old
logs... we should open a bug for this.
{quote}
While EventLoggingListener writes to UTF-8 properly when converting to byte[]
before writing, it doesn't deal with charset in logEvent().
It should be fixed, but as Marcelo said, we also need to be aware of potential
broken of reading old logs.
was:
This issue is from observation by [~vanzin] :
[https://github.com/apache/spark/pull/25670#discussion_r325383512]
Quoting his comment here:
{noformat}
This is a long standing bug in the original code, but this should be explicitly
setting the charset to UTF-8 (using new PrintWriter(new
OutputStreamWriter(...)).
The reader side should too, although doing that now could potentially break old
logs... we should open a bug for this.{noformat}
While EventLoggingListener writes to UTF-8 properly when converting to byte[]
before writing, it doesn't deal with charset in logEvent().
It should be fixed, but as Marcelo said, we also need to be aware of potential
broken of reading old logs.
> Event log file is written without specific charset which should be ideally
> UTF-8
> --------------------------------------------------------------------------------
>
> Key: SPARK-29160
> URL: https://issues.apache.org/jira/browse/SPARK-29160
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: Jungtaek Lim
> Priority: Major
>
> This issue is from observation by [~vanzin] :
> [https://github.com/apache/spark/pull/25670#discussion_r325383512]
> Quoting his comment here:
> {quote}
> This is a long standing bug in the original code, but this should be
> explicitly setting the charset to UTF-8 (using new PrintWriter(new
> OutputStreamWriter(...)).
> The reader side should too, although doing that now could potentially break
> old logs... we should open a bug for this.
> {quote}
> While EventLoggingListener writes to UTF-8 properly when converting to byte[]
> before writing, it doesn't deal with charset in logEvent().
> It should be fixed, but as Marcelo said, we also need to be aware of
> potential broken of reading old logs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]