[
https://issues.apache.org/jira/browse/FLINK-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jark Wu updated FLINK-16296:
----------------------------
Description:
Currently, when serialize a {{GenericRow}} using
{{BaseRowSerializer#serialize()}} , there will be 2 memory copy. The first is
GenericRow -> BinaryRow, the second is BinaryRow -> DataOutputView.
However, in theory, we can serialize GenericRow into DataOutputView directly,
because we already get all the column values and types. We can serialize the
null bit part for all columns and then the fix-part for all columns and then
the variable lenght part.
For example, when the column is a BinaryString, we can serialize the pos and
length, and calcute the new variable part length, and then serialize the next
column. If there is a generic type in the row, then it will fallback into
previous way. But generic type in SQL is rare.
This is a general improvements and can be benefit for every operators.
If this can be done, then {{GenericRow}} is always the best choice for
producers, and {{BinaryRow}} is always the best choice for consumers. For
example, constructing a GenericRow or BinaryRow with existing {{String,
Integer, Long}} fields, and serailize into network. The GenericRow can simpliy
wraps on the {{String, Integer, Long}} values and seralize into network
directly with only one memory copy. However, BinaryRow will copy {{String,
Integer, Long}} fields into a bytes[] and then copy the byte[] into network.
It involves two memory copy.
was:
Currently, when serialize a {{GenericRow}} using
{{BaseRowSerializer#serialize()}} , there will be 2 memory copy. The first is
GenericRow -> BinaryRow, the second is BinaryRow -> DataOutputView.
However, in theory, we can serialize GenericRow into DataOutputView directly,
because we already get all the column values and types. We can serialize the
null bit part for all columns and then the fix-part for all columns and then
the variable lenght part.
For example, when the column is a BinaryString, we can serialize the pos and
length, and calcute the new variable part length, and then serialize the next
column. If there is a generic type in the row, then it will fallback into
previous way. But generic type in SQL is rare.
This is a general improvements and can be benefit for every operators.
If this can be done, then {{GenericRow}} is always the best choice for
producers, and {{BinaryRow}} is always the best choice for consumers.
> Improve performance of BaseRowSerializer#serialize() for GenericRow
> -------------------------------------------------------------------
>
> Key: FLINK-16296
> URL: https://issues.apache.org/jira/browse/FLINK-16296
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / Runtime
> Reporter: Jark Wu
> Priority: Major
>
> Currently, when serialize a {{GenericRow}} using
> {{BaseRowSerializer#serialize()}} , there will be 2 memory copy. The first is
> GenericRow -> BinaryRow, the second is BinaryRow -> DataOutputView.
> However, in theory, we can serialize GenericRow into DataOutputView directly,
> because we already get all the column values and types. We can serialize the
> null bit part for all columns and then the fix-part for all columns and then
> the variable lenght part.
> For example, when the column is a BinaryString, we can serialize the pos and
> length, and calcute the new variable part length, and then serialize the next
> column. If there is a generic type in the row, then it will fallback into
> previous way. But generic type in SQL is rare.
> This is a general improvements and can be benefit for every operators.
> If this can be done, then {{GenericRow}} is always the best choice for
> producers, and {{BinaryRow}} is always the best choice for consumers. For
> example, constructing a GenericRow or BinaryRow with existing {{String,
> Integer, Long}} fields, and serailize into network. The GenericRow can
> simpliy wraps on the {{String, Integer, Long}} values and seralize into
> network directly with only one memory copy. However, BinaryRow will copy
> {{String, Integer, Long}} fields into a bytes[] and then copy the byte[]
> into network. It involves two memory copy.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)