[
https://issues.apache.org/jira/browse/FLINK-36627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894945#comment-17894945
]
Baiqing Lyu commented on FLINK-36627:
-------------------------------------
I took a brief look at this problem at it seems like the current
_CsvReaderFormat_ class does not expose a way for users to specify a character
encoding set.
One potential solution would be the addition of new _forPojo_ and _forSchema_
builders to accept a new charset option, probability the object
_org.apache.commons.io.Charsets_ would work here.
Finally, the question for existing members would be is this a necessary
addition? Or is this something not expected to be supported.
I'm new to the contributing guide, after reviewing the code contribution
process I figure commenting here is appropriate, let me know if I should be
using the mailing list or any other methods.
> Failure to process a CSV file in Flink due to a character encoding mismatch:
> the file is in ISO-8859 and the application expects UTF-8.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-36627
> URL: https://issues.apache.org/jira/browse/FLINK-36627
> Project: Flink
> Issue Type: Bug
> Reporter: Hector Miuler Malpica Gallegos
> Priority: Major
>
> I have error in read csv with charset ISO-8859, my error is the following:
> {{{color:#de350b}_Caused by: java.io.CharConversionException: Invalid UTF-8
> middle byte 0x41 (at char #1247, byte #1246): check content encoding, does
> not look like UTF-8_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.csv.impl.UTF8Reader.reportInvalidOther(UTF8Reader.java:520)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.csv.impl.UTF8Reader.reportDeferredInvalid(UTF8Reader.java:531)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.csv.impl.UTF8Reader.read(UTF8Reader.java:177)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.loadMore(CsvDecoder.java:458)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder._nextUnquotedString(CsvDecoder.java:782)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.csv.impl.CsvDecoder.nextString(CsvDecoder.java:732)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.csv.CsvParser._handleNextEntry(CsvParser.java:963)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.dataformat.csv.CsvParser.nextFieldName(CsvParser.java:763)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:321)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:177)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.MappingIterator.nextValue(MappingIterator.java:283)_{color}}}
> {{{color:#de350b} _at
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:199)_{color}}}
> {{{color:#de350b} _... 11 more_{color}}}
>
>
> {{My code is the following:}}
> {color:#0747a6}_{{{}val env =
> StreamExecutionEnvironment.createLocalEnvironment(){}}}{{{}val csvFormat =
> CsvReaderFormat.forPojo(Empresa::class.java){}}}_{color}
> {color:#0747a6}_{{val csvSource = FileSource}}_{color}
> {color:#0747a6}_{{.forRecordStreamFormat(csvFormat,
> Path("/miuler/PadronRUC_202410.csv"))}}_{color}
> {color:#0747a6}_{{.build()}}_{color}
> {color:#0747a6}_{{val empresaStreamSource = env.fromSource(csvSource,
> WatermarkStrategy.noWatermarks(), "CSV Source")}}_{color}
> {color:#0747a6}_{{empresaStreamSource.print()}}_{color}
> {color:#0747a6}_{{env.execute("Load CSV")}}_{color}
>
>
> My dependencies:
> _{color:#0747a6}{{val kotlinVersion = "1.20.0"}}{color}_
> _{color:#0747a6}{{dependencies {}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-shaded-jackson:2.15.3-19.0")}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-core:$kotlinVersion")}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-runtime:$kotlinVersion")}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-runtime-web:$kotlinVersion")}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-clients:$kotlinVersion")}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-streaming-java:$kotlinVersion")}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-csv:$kotlinVersion")}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-connector-base:$kotlinVersion")}}{color}_
>
> _{color:#0747a6}{{implementation("org.apache.flink:flink-connector-files:$kotlinVersion")}}{color}_
> _{color:#0747a6}}{color}_
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)