This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git


The following commit(s) were added to refs/heads/master by this push:
     new b118e63b7 [doc] Document Spec: table index and file index
b118e63b7 is described below

commit b118e63b78ac2b01922814aa016c4b24f69079fa
Author: Jingsong <[email protected]>
AuthorDate: Wed Aug 7 14:22:48 2024 +0800

    [doc] Document Spec: table index and file index
---
 docs/content/concepts/spec/fileindex.md            | 138 +++++++++++++++++++++
 docs/content/concepts/spec/indexfile.md            |  42 -------
 docs/content/concepts/spec/snapshot.md             |   2 +-
 docs/content/concepts/spec/tableindex.md           |  57 +++++++++
 docs/static/img/deletion-file.png                  | Bin 0 -> 1160387 bytes
 .../bloomfilter/BloomFilterFileIndex.java          |   2 +-
 6 files changed, 197 insertions(+), 44 deletions(-)

diff --git a/docs/content/concepts/spec/fileindex.md 
b/docs/content/concepts/spec/fileindex.md
new file mode 100644
index 000000000..6a8169aef
--- /dev/null
+++ b/docs/content/concepts/spec/fileindex.md
@@ -0,0 +1,138 @@
+---
+title: "File Index"
+weight: 7
+type: docs
+aliases:
+- /concepts/spec/fileindex.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# File index
+
+Define `file-index.${index_type}.columns`, Paimon will create its 
corresponding index file for each file. If the index
+file is too small, it will be stored directly in the manifest, or in the 
directory of the data file. Each data file
+corresponds to an index file, which has a separate file definition and can 
contain different types of indexes with
+multiple columns.
+
+## Index File
+
+File index file format. Put all column and offset in the header.
+
+<pre>
+  _____________________________________    _____________________
+|     magic    |version|head length |
+|-------------------------------------|
+|            column number            |
+|-------------------------------------|
+|   column 1        | index number   |
+|-------------------------------------|
+|  index name 1 |start pos |length  |
+|-------------------------------------|
+|  index name 2 |start pos |length  |
+|-------------------------------------|
+|  index name 3 |start pos |length  |
+|-------------------------------------|            HEAD
+|   column 2        | index number   |
+|-------------------------------------|
+|  index name 1 |start pos |length  |
+|-------------------------------------|
+|  index name 2 |start pos |length  |
+|-------------------------------------|
+|  index name 3 |start pos |length  |
+|-------------------------------------|
+|                 ...                 |
+|-------------------------------------|
+|                 ...                 |
+|-------------------------------------|
+|  redundant length |redundant bytes |
+|-------------------------------------|    ---------------------
+|                BODY                 |
+|                BODY                 |
+|                BODY                 |             BODY
+|                BODY                 |
+|_____________________________________|    _____________________
+*
+magic:                            8 bytes long, value is 1493475289347502L, 
BIT_ENDIAN
+version:                          4 bytes int, BIT_ENDIAN
+head length:                      4 bytes int, BIT_ENDIAN
+column number:                    4 bytes int, BIT_ENDIAN
+column x name:                    2 bytes short BIT_ENDIAN and Java 
modified-utf-8
+index number:                     4 bytes int (how many column items below), 
BIT_ENDIAN
+index name x:                     2 bytes short BIT_ENDIAN and Java 
modified-utf-8
+start pos:                        4 bytes int, BIT_ENDIAN
+length:                           4 bytes int, BIT_ENDIAN
+redundant length:                 4 bytes int (for compatibility with later 
versions, in this version, content is zero)
+redundant bytes:                  var bytes (for compatibility with later 
version, in this version, is empty)
+BODY:                             column index bytes + column index bytes + 
column index bytes + .......
+</pre>
+
+## Column Index Bytes: BloomFilter
+
+Define `'file-index.bloom-filter.columns'`.
+
+Content of bloom filter index is simple: 
+- numHashFunctions 4 bytes int, BIT_ENDIAN
+- bloom filter bytes
+
+This class use (64-bits) long hash. Store the num hash function (one integer) 
and bit set bytes only. Hash bytes type 
+(like varchar, binary, etc.) using xx hash, hash numeric type by [specified 
number 
hash](http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm).
+
+## Column Index Bytes: Bitmap
+
+Define `'file-index.bitmap.columns'`.
+
+Bitmap file index format (V1):
+
+<pre>
+Bitmap file index format (V1)
++-------------------------------------------------+-----------------
+| version (1 byte)                               |
++-------------------------------------------------+
+| row count (4 bytes int)                        |
++-------------------------------------------------+
+| non-null value bitmap number (4 bytes int)     |
++-------------------------------------------------+
+| has null value (1 byte)                        |
++-------------------------------------------------+
+| null value offset (4 bytes if has null value)  |       HEAD
++-------------------------------------------------+
+| value 1 | offset 1                             |
++-------------------------------------------------+
+| value 2 | offset 2                             |
++-------------------------------------------------+
+| value 3 | offset 3                             |
++-------------------------------------------------+
+| ...                                            |
++-------------------------------------------------+-----------------
+| serialized bitmap 1                            |
++-------------------------------------------------+
+| serialized bitmap 2                            |
++-------------------------------------------------+       BODY
+| serialized bitmap 3                            |
++-------------------------------------------------+
+| ...                                            |
++-------------------------------------------------+-----------------
+*
+value x:                       var bytes for any data type (as bitmap 
identifier)
+offset:                        4 bytes int (when it is negative, it represents 
that there is only one value
+                                 and its position is the inverse of the 
negative value)
+</pre>
+
+Integer are all BIT_ENDIAN.
diff --git a/docs/content/concepts/spec/indexfile.md 
b/docs/content/concepts/spec/indexfile.md
deleted file mode 100644
index cfcbcade9..000000000
--- a/docs/content/concepts/spec/indexfile.md
+++ /dev/null
@@ -1,42 +0,0 @@
----
-title: "IndexFile"
-weight: 6
-type: docs
-aliases:
-- /concepts/spec/indexfile.html
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# IndexFile
-
-## Global Index
-
-Global Index is in the index directory, currently, only two places will use 
global index:
-
-1. bucket = -1 + primary key table: in dynamic bucket mode, the index records 
the correspondence between the hash value
-   of the primary-key and the bucket, each bucket has an index file.
-2. Deletion Vectors: index stores the deletion file, and each bucket has a 
deletion file.
-
-## Data File Index
-
-Define `file-index.bloom-filter.columns`, Paimon will create its corresponding 
index file for each file. If the index
-file is too small, it will be stored directly in the manifest, or in the 
directory of the data file. Each data file
-corresponds to an index file, which has a separate file definition and can 
contain different types of indexes with
-multiple columns.
diff --git a/docs/content/concepts/spec/snapshot.md 
b/docs/content/concepts/spec/snapshot.md
index 5c0f58ac4..d10598272 100644
--- a/docs/content/concepts/spec/snapshot.md
+++ b/docs/content/concepts/spec/snapshot.md
@@ -53,7 +53,7 @@ Snapshot File is JSON, it includes:
 4. baseManifestList: a manifest list recording all changes from the previous 
snapshots.
 5. deltaManifestList: a manifest list recording all new changes occurred in 
this snapshot.
 6. changelogManifestList: a manifest list recording all changelog produced in 
this snapshot, null if no changelog is produced.
-7. indexManifest: a manifest recording all index files of this table, null if 
no index file.
+7. indexManifest: a manifest recording all index files of this table, null if 
no table index file.
 8. commitUser: usually generated by UUID, it is used for recovery of streaming 
writes, one stream write job with one user.
 9. commitIdentifier: transaction id corresponding to streaming write, each 
transaction may result in multiple commits for different commitKinds.
 10. commitKind: type of changes in this snapshot, including append, compact, 
overwrite and analyze.
diff --git a/docs/content/concepts/spec/tableindex.md 
b/docs/content/concepts/spec/tableindex.md
new file mode 100644
index 000000000..e88f9e6d3
--- /dev/null
+++ b/docs/content/concepts/spec/tableindex.md
@@ -0,0 +1,57 @@
+---
+title: "Table Index"
+weight: 6
+type: docs
+aliases:
+- /concepts/spec/tableindex.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Table index
+
+Table Index files is in the `index` directory.
+
+## Dynamic Bucket Index
+
+Dynamic bucket index is used to store the correspondence between the hash 
value of the primary-key and the bucket.
+
+Its structure is very simple, only storing hash values in the file:
+
+HASH_VALUE | HASH_VALUE | HASH_VALUE | HASH_VALUE | ...
+
+HASH_VALUE is the hash value of the primary-key. 4 bytes, BIT_ENDIAN.
+
+## Deletion Vectors
+
+Deletion file is used to store the deleted records position for each data 
file. Each bucket has one deletion file for
+primary key table.
+
+{{< img src="/img/deletion-file.png">}}
+
+The deletion file is a binary file, and the format is as follows:
+
+- First, record version by a byte. Current version is 1.
+- Then, record <size of serialized bin, serialized bin, checksum of serialized 
bin> in sequence.
+- Size and checksum are BIT_ENDIAN Integer.
+
+For each serialized bin:
+
+- First, record a const magic number by an int (BIT_ENDIAN). Current the magic 
number is 1581511376.
+- Then, record serialized bitmap. Which is a 
[RoaringBitmap](https://github.com/RoaringBitmap/RoaringBitmap) 
(org.roaringbitmap.RoaringBitmap).
diff --git a/docs/static/img/deletion-file.png 
b/docs/static/img/deletion-file.png
new file mode 100644
index 000000000..e66aa4361
Binary files /dev/null and b/docs/static/img/deletion-file.png differ
diff --git 
a/paimon-common/src/main/java/org/apache/paimon/fileindex/bloomfilter/BloomFilterFileIndex.java
 
b/paimon-common/src/main/java/org/apache/paimon/fileindex/bloomfilter/BloomFilterFileIndex.java
index ce7827a98..3c9dcadba 100644
--- 
a/paimon-common/src/main/java/org/apache/paimon/fileindex/bloomfilter/BloomFilterFileIndex.java
+++ 
b/paimon-common/src/main/java/org/apache/paimon/fileindex/bloomfilter/BloomFilterFileIndex.java
@@ -101,7 +101,7 @@ public class BloomFilterFileIndex implements FileIndexer {
         public byte[] serializedBytes() {
             int numHashFunctions = filter.getNumHashFunctions();
             byte[] serialized = new byte[filter.getBitSet().bitSize() / 
Byte.SIZE + Integer.BYTES];
-            // little endian
+            // big endian
             serialized[0] = (byte) ((numHashFunctions >>> 24) & 0xFF);
             serialized[1] = (byte) ((numHashFunctions >>> 16) & 0xFF);
             serialized[2] = (byte) ((numHashFunctions >>> 8) & 0xFF);

Reply via email to