gortiz commented on code in PR #13303:
URL: https://github.com/apache/pinot/pull/13303#discussion_r1670135123
##########
pinot-common/src/main/java/org/apache/pinot/common/datablock/BaseDataBlock.java:
##########
@@ -426,168 +401,202 @@ public Map<Integer, String> getExceptions() {
return _errCodeToExceptionMap;
}
- /**
- * Serialize this data block to a byte array.
- * <p>
- * In order to deserialize it, {@link
DataBlockUtils#getDataBlock(ByteBuffer)} should be used.
- */
@Override
- public byte[] toBytes()
+ public List<ByteBuffer> serialize()
throws IOException {
- ThreadResourceUsageProvider threadResourceUsageProvider = new
ThreadResourceUsageProvider();
-
- UnsynchronizedByteArrayOutputStream byteArrayOutputStream = new
UnsynchronizedByteArrayOutputStream(8192);
- DataOutputStream dataOutputStream = new
DataOutputStream(byteArrayOutputStream);
- writeLeadingSections(dataOutputStream);
-
- // Write metadata: length followed by actual metadata bytes.
- // NOTE: We ignore metadata serialization time in
"responseSerializationCpuTimeNs" as it's negligible while
- // considering it will bring a lot code complexity.
- serializeMetadata(dataOutputStream);
-
- return byteArrayOutputStream.toByteArray();
- }
-
- private void writeLeadingSections(DataOutputStream dataOutputStream)
- throws IOException {
- dataOutputStream.writeInt(getDataBlockVersionType());
- dataOutputStream.writeInt(_numRows);
- dataOutputStream.writeInt(_numColumns);
- int dataOffset = HEADER_SIZE;
-
- // Write exceptions section offset(START|SIZE).
- dataOutputStream.writeInt(dataOffset);
- byte[] exceptionsBytes;
- exceptionsBytes = serializeExceptions();
- dataOutputStream.writeInt(exceptionsBytes.length);
- dataOffset += exceptionsBytes.length;
-
- // Write dictionary map section offset(START|SIZE).
- dataOutputStream.writeInt(dataOffset);
- byte[] dictionaryBytes = null;
- if (_stringDictionary != null) {
- dictionaryBytes = serializeStringDictionary();
- dataOutputStream.writeInt(dictionaryBytes.length);
- dataOffset += dictionaryBytes.length;
- } else {
- dataOutputStream.writeInt(0);
- }
-
- // Write data schema section offset(START|SIZE).
- dataOutputStream.writeInt(dataOffset);
- byte[] dataSchemaBytes = null;
- if (_dataSchema != null) {
- dataSchemaBytes = _dataSchema.toBytes();
- dataOutputStream.writeInt(dataSchemaBytes.length);
- dataOffset += dataSchemaBytes.length;
- } else {
- dataOutputStream.writeInt(0);
+ if (_serialized == null) {
+ _serialized = DataBlockUtils.serialize(this);
Review Comment:
I think the opposite. Having serdes as separate classes:
1. Simplifies the class. BaseDataBlock is very complex, by removing the
serialization/deserialization code we simplify this class.
2. Separates responsibilities.
* DataBlock defines how to use the block in the OpChain. On a cluster
running different versions of Pinot, the memory layout of a data block may be
different.
* DataBlockSerde defines the layout used to send the blocks through the
network. Different Pinot instances may support different block versions, but in
order to be able to talk to each other they have to have at least one version
in common.
3. Opens the possibility of having different implementations.
* Imagine BaseDataBlock includes the serde code. If we want to try a new
implementation, we need to modify the code, destroying the older code. That
means it is more difficult to test or benchmark the new version.
* In the first commits of this PR I included a class that implemented the
V1_V2 protocol version but using almost the same code we use right now in
master. I also kept the old `toBytes` code. It was easy to benchmark each
implementation by having these options. I decided to remove these classes to
make this PR simpler, but during development it was useful.
* Imagine we create a new serde for the current V1_V2 version that is
more efficient. The code does not support it right now but it is very easy to
be able to configure which serde we want to use for each version with
PinotConfiguration. This means we can try to use one serde by default but
change to the other in case we find a bug.
> I think keeping it within the class is easier for version control.
I don't understand this. Obviously this PR destroys git history. But once we
start using this mechanism there will be no difference in terms of git.
Imagine that we add a V3 that serializes blocks using Apache Arrow or
Parquet. V1_V2 and this theoretical V3 would be super different.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]