[jira] [Commented] (FLINK-16296) Improve performance of BaseRowSerializer#serialize() for GenericRow

Kurt Young (Jira) Thu, 27 Feb 2020 01:07:17 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046366#comment-17046366
 ]


Kurt Young commented on FLINK-16296:
------------------------------------

The reason behind this was an non-obvious design choice:

 
{noformat}
BinaryRow is the standard binary format for all base rows. {noformat}
In another word, `BinaryRow` controls the binary format for all base row. So 
the safest way for a `BaseRowSerializer` to generate a correct binary format is 
converting the base row to `BinaryRow` first and then do the serialization via 
bytes copy. 

 

If we want to do such optimization, we should break our old design choice, we 
should say:

 
{noformat}
We establish a stand binary format somewhere in our code base, and all base 
rows should comply with such standard, includes BinaryRow and 
GenericRow.{noformat}
It sounds like a not big deal, but IMO is quite important, for developers and 
future modifications. 

 

 

> Improve performance of BaseRowSerializer#serialize() for GenericRow
> -------------------------------------------------------------------
>
>                 Key: FLINK-16296
>                 URL: https://issues.apache.org/jira/browse/FLINK-16296
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Runtime
>            Reporter: Jark Wu
>            Priority: Major
>
> Currently, when serialize a {{GenericRow}} using 
> {{BaseRowSerializer#serialize()}} , there will be 2 memory copy. The first is 
> GenericRow -> BinaryRow, the second is  BinaryRow -> DataOutputView. 
> However, in theory, we can serialize GenericRow into DataOutputView directly, 
> because we already get all the column values and types. We can serialize the 
> null bit part for all columns and then the fix-part for all columns and then 
> the variable lenght part. 
> For example, when the column is a BinaryString, we can serialize the pos and 
> length, and calcute the new variable part length, and then serialize the next 
> column. If there is a generic type in the row, then it will fallback into 
> previous way. But generic type in SQL is rare. 
> This is a general improvements and can be benefit for every operators. 
> If this can be done, then {{GenericRow}} is always the best choice for 
> producers, and {{BinaryRow}} is always the best choice for consumers.  For 
> example, constructing a GenericRow or BinaryRow with existing {{(String, 
> Integer, Long)}} fields, and serailize into network. The GenericRow can 
> simpliy wraps on the {{(String, Integer, Long)}} values and seralize into 
> network directly with only one memory copy. However, BinaryRow will copy 
> {{(String, Integer, Long)}}  fields into a bytes[] and then copy the byte[] 
> into network. It involves two memory copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16296) Improve performance of BaseRowSerializer#serialize() for GenericRow

Reply via email to