[ 
https://issues.apache.org/jira/browse/FLINK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260178#comment-17260178
 ] 

Jark Wu edited comment on FLINK-20795 at 1/7/21, 2:37 AM:
----------------------------------------------------------

If we want to refactor this configuration. I would suggest to investigate how 
other projects handle this, e.g. Spark, Hive, Presto, Kafka. 

For example, Spark provides a ParseMode for dealing with corrupt records during 
parsing, it allows the following modes:

- PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a new field configured by 
columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra 
fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records.

See 
https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/DataFrameReader.html
 and 


was (Author: jark):
If we want to refactor this configuration. I would suggest to investigate how 
other projects handle this, e.g. Spark, Hive, Presto, Kafka. 

For example, Spark provides a ParseMode for dealing with corrupt records during 
parsing, it allows the following modes:

- PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a new field configured by 
columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra 
fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records.

> add a parameter to decide whether print dirty record when 
> `ignore-parse-errors` is true
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-20795
>                 URL: https://issues.apache.org/jira/browse/FLINK-20795
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table 
> SQL / Ecosystem
>    Affects Versions: 1.13.0
>            Reporter: zoucao
>            Priority: Major
>
> add a parameter to decide whether print dirty data when 
> `ignore-parse-errors`=true, some users want to make his task stability and 
> know the dirty record to fix the upstream, too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to