[
https://issues.apache.org/jira/browse/HUDI-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764444#comment-17764444
]
Ethan Guo commented on HUDI-6824:
---------------------------------
Scope of Concerns
- Endianness - the order in which the bytes of a multi-byte value are stored
- Serialized bytes that may change across platforms, language, architecture,
machine, etc.
Log Format
- Serialization are done by the HoodieLogFormatWriter , which uses Hadoop's
FSDataOutputStream extending Java's DataOutputStream , using big-endian only
- The following calls are used
-- outputStream.writeLong : Writes an int to the underlying output stream as
four bytes, high byte first.
-- outputStream.writeInt : Writes a long to the underlying output stream as
eight bytes, high byte first.
- Log block header and footer are turned into byte array from String value of
each type
-- HoodieLogBlock.getLogMetadataBytes
-- String#getBytes() is used for encoding, using the platform's default
charset. The default character encoding scheme on Windows is ANSI, while the
default character encoding scheme on Linux is UTF-8. This should be fixed to
use a predetermined charset.
- Content bytes of each log block type
-- HoodieDataBlock
--- Avro/CDC: serialized using Avro's BinaryEncoder and
GenericDatumWriter<IndexedRecord> . Version, size info are written with
outputStream.writeInt
--- Parquet: serialized using parquet writer HoodieParquetStreamWriter or
HoodieSparkParquetStreamWriter (using HoodieRowParquetWriteSupport )
--- HFile: serialized using HFile writer HFile.Writer . Each record value is
serialized by Avro (`HoodieAvroUtils.indexedRecordToBytes`). Schema is
serialized by String.getBytes()
-- HoodieDeleteBlock
--- delete records are serialized using Avro's
DatumWriter<HoodieDeleteRecordList> and BinaryEncoder
-- HoodieCommandBlock
--- no content
- Log Block metadata
-- INSTANT_TIME : String in instant time format
-- TARGET_INSTANT_TIME : String in instant time format
-- SCHEMA : Avro schema in JSON String
-- COMMAND_BLOCK_TYPE : Integer/id in String, representing the type in
HoodieCommandBlockTypeEnum
-- COMPACTED_BLOCK_TIMES : Comma-separated list of instant times
-- RECORD_POSITIONS : String of Base64-encoded bytes after using portable
serialization on Roaring64NavigableMap
-- BLOCK_SEQUENCE_NUMBER : writeAttemptNumber,blockSequenceNumber
Metadata in Base File
- Bloom Filter in parquet footer
-- Record key in Java String is encoded to bytes using UTF_8
-- Simple bloom filter: Base64 encoded String from byte array; the byte array
is generated by using Hadoop BloomFilter's serialization
(`SimpleBloomFilter#serializeToString`)
-- Dynamic bounded bloom filter: a few meta info encoded in integers followed
by multiple bloom filters, each of which is a Base64 encoded String from byte
array; the byte array is generated by using Hadoop BloomFilter's serialization
(`HoodieDynamicBoundedBloomFilter#serializeToString`)
-- The Hadoop BloomFilter is merely using Java data structures and we can copy
the code out if we want to remove the "dependency" on Hadoop BloomFilter
MDT
- Bloom Filter (`HoodieMetadataPayload.createBloomFilterMetadataRecord`)
-- Bloom filters are read out to ByteBuffer and used to create metadata
payload, storing byte array serialized by Avro
- Column Stats (`HoodieMetadataPayload.createColumnStatsRecords`)
-- Min and max values in Java type are wrapped by Avro Wrapper and serialized
with Avro
> Make sure serialization of log blocks is language and architecture independent
> ------------------------------------------------------------------------------
>
> Key: HUDI-6824
> URL: https://issues.apache.org/jira/browse/HUDI-6824
> Project: Apache Hudi
> Issue Type: Task
> Reporter: Ethan Guo
> Assignee: Ethan Guo
> Priority: Blocker
> Fix For: 1.0.0
>
>
> Serialization of log blocks and other information to bytes should be language
> and architecture independent, e.g., there should be no issue around
> big-endian and little-endian.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)