[
https://issues.apache.org/jira/browse/BEAM-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981254#comment-15981254
]
ASF GitHub Bot commented on BEAM-2060:
--------------------------------------
GitHub user jbonofre opened a pull request:
https://github.com/apache/beam/pull/2660
[BEAM-2060] Allow to specify charset in XmlIO
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
- [X] Make sure the PR title is formatted like:
`[BEAM-<Jira issue #>] Description of pull request`
- [X] Make sure tests pass via `mvn clean verify`. (Even better, enable
Travis-CI on your fork and ensure the whole test matrix passes).
- [X] Replace `<Jira issue #>` in the title with the actual Jira issue
number, if there is one.
- [X] If this contribution is large, please file an Apache
[Individual Contributor License
Agreement](https://www.apache.org/licenses/icla.pdf).
---
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jbonofre/beam BEAM-2060-ENCODING
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/2660.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2660
----
commit 10a11bbd143aea9ce4fd5d0d714329ea5599e052
Author: Jean-Baptiste Onofré <[email protected]>
Date: 2017-04-24T14:37:40Z
[BEAM-2060] Allow to specify charset in XmlIO
----
> XmlSource use harcoded Charset
> ------------------------------
>
> Key: BEAM-2060
> URL: https://issues.apache.org/jira/browse/BEAM-2060
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Affects Versions: 0.6.0
> Reporter: Damien GOUYETTE
> Assignee: Jean-Baptiste Onofré
>
> When i use a file encoded with ISO-8859-1 with a caracter *é* i got an
> exception like :
> {code}
> Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64
> (at char #1061, byte #1012)
> at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105)
> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86)
> at
> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
> at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
> ... 19 more
> {code}
> Encoding is hardcoded :
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342
>
> It would be great if i can specify it like :
> {code}
> XmlSource.from[MyClass](input)
> .withRootElement("ROOT_ELEMENT")
> .withRecordElement("MyClass")
> .withRecordClass(classOf[MyClass])
> .withCharset(StandardCharsets.ISO_8859_1)
> {code}
> I can provide a pull request if you want
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)