[ 
https://issues.apache.org/jira/browse/FLINK-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581431#comment-16581431
 ] 

ASF GitHub Bot commented on FLINK-9964:
---------------------------------------

buptljy commented on issue #6541: [FLINK-9964] [table] Add a CSV table format 
factory
URL: https://github.com/apache/flink/pull/6541#issuecomment-413285602
 
 
   @twalthr 
   I've replied a few coments above and optimize some codes according to your 
coments.
   I've finished:
   1. Null value configuration.
   2. Schema derivation.
   3. some optimizations.
   
   About the encoding: The encoding for csv data can only be one of elements of 
com.fasterxml.jackson.core.JsonEncoding, and the jackson reader is able to 
automatically detect the encoding according to the rules of 
[rfc4627](http://www.ietf.org/rfc/rfc4627.txt). So we don't need to set the 
encoding mannually, and we can't allow users to use other encodings that 
JsonEncoding doesn't support, such as 'latin'.
   
   About the byte array: The byte array logic is weird because of the internal 
logic of the jackson that I explained in CsvRowSerializationSchema(line: 159). 
We regard the byte array as string to avoid unnecessary logic because jackson 
use base64 to deal with byte array(CsvGenerator: line 691), which means our 
users cannot give their original byte array, otherwise they cannot get original 
content after serializing or deserializing(see the codes below). Additionally, 
byte array is regarded binaryNode in jackson, so we cannot convert byte array 
like what we do with other array. 
   
   ```
   byte[] origin = "123".getBytes();
   CsvSchema schema = CsvSchema.builder()
                .addColumn("a", STRING).build();
   CsvMapper cm = new CsvMapper();
   JsonNode result = 
cm.readerFor(JsonNode.class).with(schema).readValue(origin);
   byte[] transformed = result.binaryValue();
   System.out.println(Arrays.equals(transformed, origin)); (expect true, actual 
false)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add a CSV table format factory
> ------------------------------
>
>                 Key: FLINK-9964
>                 URL: https://issues.apache.org/jira/browse/FLINK-9964
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Table API & SQL
>            Reporter: Timo Walther
>            Assignee: buptljy
>            Priority: Major
>              Labels: pull-request-available
>
> We should add a RFC 4180 compliant CSV table format factory to read and write 
> data into Kafka and other connectors. This requires a 
> {{SerializationSchemaFactory}} and {{DeserializationSchemaFactory}}. How we 
> want to represent all data types and nested types is still up for discussion. 
> For example, we could flatten and deflatten nested types as it is done 
> [here|http://support.gnip.com/articles/json2csv.html]. We can also have a 
> look how tools such as the Avro to CSV tool perform the conversion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to