[ 
https://issues.apache.org/jira/browse/HUDI-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764444#comment-17764444
 ] 

Ethan Guo commented on HUDI-6824:
---------------------------------

Scope of Concerns
 - Endianness - the order in which the bytes of a multi-byte value are stored
 - Serialized bytes that may change across platforms, language, architecture, 
machine, etc.

Log Format
 - Serialization are done by the HoodieLogFormatWriter , which uses Hadoop's 
FSDataOutputStream extending Java's DataOutputStream , using big-endian only
 - The following calls are used
 -- outputStream.writeLong : Writes an int to the underlying output stream as 
four bytes, high byte first.
 -- outputStream.writeInt : Writes a long to the underlying output stream as 
eight bytes, high byte first.
- Log block header and footer are turned into byte array from String value of 
each type
-- HoodieLogBlock.getLogMetadataBytes 
-- String#getBytes() is used for encoding, using the platform's default 
charset. The default character encoding scheme on Windows is ANSI, while the 
default character encoding scheme on Linux is UTF-8. This should be fixed to 
use a predetermined charset.
- Content bytes of each log block type
-- HoodieDataBlock
--- Avro/CDC: serialized using Avro's BinaryEncoder and 
GenericDatumWriter<IndexedRecord> . Version, size info are written with 
outputStream.writeInt
--- Parquet: serialized using parquet writer HoodieParquetStreamWriter or 
HoodieSparkParquetStreamWriter (using HoodieRowParquetWriteSupport )
--- HFile: serialized using HFile writer HFile.Writer . Each record value is 
serialized by Avro (`HoodieAvroUtils.indexedRecordToBytes`). Schema is 
serialized by String.getBytes() 
-- HoodieDeleteBlock
--- delete records are serialized using Avro's 
DatumWriter<HoodieDeleteRecordList> and BinaryEncoder 
-- HoodieCommandBlock
--- no content
- Log Block metadata
-- INSTANT_TIME : String in instant time format
-- TARGET_INSTANT_TIME : String in instant time format
-- SCHEMA : Avro schema in JSON String
-- COMMAND_BLOCK_TYPE : Integer/id in String, representing the type in 
HoodieCommandBlockTypeEnum 
-- COMPACTED_BLOCK_TIMES : Comma-separated list of instant times
-- RECORD_POSITIONS : String of Base64-encoded bytes after using portable 
serialization on Roaring64NavigableMap 
-- BLOCK_SEQUENCE_NUMBER : writeAttemptNumber,blockSequenceNumber

Metadata in Base File
- Bloom Filter in parquet footer
 -- Record key in Java String is encoded to bytes using UTF_8
 -- Simple bloom filter: Base64 encoded String from byte array; the byte array 
is generated by using Hadoop BloomFilter's serialization 
(`SimpleBloomFilter#serializeToString`)
 -- Dynamic bounded bloom filter: a few meta info encoded in integers followed 
by multiple bloom filters, each of which is a Base64 encoded String from byte 
array; the byte array is generated by using Hadoop BloomFilter's serialization 
(`HoodieDynamicBoundedBloomFilter#serializeToString`)
 -- The Hadoop BloomFilter is merely using Java data structures and we can copy 
the code out if we want to remove the "dependency" on Hadoop BloomFilter

MDT
- Bloom Filter (`HoodieMetadataPayload.createBloomFilterMetadataRecord`)
-- Bloom filters are read out to ByteBuffer and used to create metadata 
payload, storing byte array serialized by Avro
- Column Stats (`HoodieMetadataPayload.createColumnStatsRecords`)
-- Min and max values in Java type are wrapped by Avro Wrapper and serialized 
with Avro

> Make sure serialization of log blocks is language and architecture independent
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-6824
>                 URL: https://issues.apache.org/jira/browse/HUDI-6824
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Ethan Guo
>            Assignee: Ethan Guo
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> Serialization of log blocks and other information to bytes should be language 
> and architecture independent, e.g., there should be no issue around 
> big-endian and little-endian.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to