leaves12138 commented on code in PR #9:
URL: https://github.com/apache/paimon-mosaic/pull/9#discussion_r3265356538


##########
docs/cpp-api.html:
##########
@@ -0,0 +1,400 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>C++ API - Mosaic</title>
+    <link rel="stylesheet" href="css/style.css">
+    <script src="js/main.js"></script>
+</head>
+<body>
+    <button class="menu-toggle" aria-label="Menu">&#9776;</button>
+    <div class="overlay"></div>
+
+    <aside class="sidebar">
+        <div class="sidebar-header">
+            <h2>Mosaic</h2>
+            <p>Columnar-bucket hybrid format</p>
+        </div>
+        <nav>
+            <ul>
+                <li><a href="index.html">Home</a></li>
+                <li><a href="design.html">Design</a></li>
+                <li><a href="rust-api.html">Rust API</a></li>
+                <li><a href="java-api.html">Java API</a></li>
+                <li><a href="python-api.html">Python API</a></li>
+                <li><a href="cpp-api.html">C++ API</a></li>
+            </ul>
+        </nav>
+        <div class="sidebar-footer">
+            <button class="theme-toggle">Dark Mode</button>
+        </div>
+    </aside>
+
+    <main class="main">
+        <div class="content">
+            <h1>C++ API</h1>
+            <p class="subtitle">Use Mosaic from C or C++ via the FFI 
bindings.</p>
+
+            <h2>Overview</h2>
+            <p>
+                The <code>ffi/</code> crate generates a shared library 
(<code>libmosaic_ffi</code>) and a
+                C header (<code>mosaic.h</code>) via <a 
href="https://github.com/mozilla/cbindgen";>cbindgen</a>.
+                The C++ header (<code>mosaic.hpp</code>) is a hand-written 
RAII wrapper on top of the C API.
+            </p>
+
+            <h2>Building</h2>
+<pre><code><span class="cmt"># Build the FFI shared library</span>
+cargo build --release -p mosaic-ffi
+
+<span class="cmt"># C header generated at include/mosaic.h</span>
+<span class="cmt"># C++ RAII wrapper:     include/mosaic.hpp (checked in, not 
generated)</span></code></pre>
+
+            <h2>Linking</h2>
+            <p>Link against the shared library and include the appropriate 
header:</p>
+<pre><code><span class="cmt"># macOS</span>
+g++ -std=c++17 -I include/ example.cpp \
+    -L target/release -lmosaic_ffi -o example
+
+<span class="cmt"># Linux</span>
+g++ -std=c++17 -I include/ example.cpp \
+    -L target/release -lmosaic_ffi -Wl,-rpath,target/release -o 
example</code></pre>
+
+            <h2>Writing (C++)</h2>
+            <p>
+                Data is written as Arrow RecordBatches via the
+                <a 
href="https://arrow.apache.org/docs/format/CDataInterface.html";>Arrow C Data 
Interface</a>.
+                Build your data as Arrow arrays, export via 
<code>ArrowArray</code> / <code>ArrowSchema</code>,
+                then pass to the writer:
+            </p>
+<pre><code><span class="kw">#include</span> <span 
class="str">"mosaic.hpp"</span>
+<span class="kw">#include</span> &lt;arrow/api.h&gt;
+<span class="kw">#include</span> &lt;arrow/c/bridge.h&gt;
+
+<span class="kw">int</span> <span class="fn">main</span>() {
+    <span class="kw">try</span> {
+        <span class="cmt">// 1. Set up output stream callbacks</span>
+        <span class="kw">auto</span>* fp = std::fopen(<span 
class="str">"output.mosaic"</span>, <span class="str">"wb"</span>);
+        <span class="kw">auto</span> file = std::shared_ptr&lt;FILE&gt;(fp, 
[](<span class="ty">FILE</span>* f) { std::fclose(f); });
+
+        <span class="ty">mosaic</span>::<span class="ty">OutputFile</span> cbs;
+        cbs.write_fn = [file](<span class="kw">const uint8_t</span>* data, 
<span class="kw">size_t</span> len) -&gt; <span class="kw">int</span> {
+            <span class="kw">size_t</span> written = std::fwrite(data, <span 
class="num">1</span>, len, file.get());
+            <span class="kw">return</span> (written == len) ? <span 
class="num">0</span> : <span class="num">-1</span>;
+        };
+        cbs.flush_fn = [file]() -&gt; <span class="kw">int</span> { <span 
class="kw">return</span> std::fflush(file.get()); };
+        cbs.get_pos_fn = [file]() -&gt; <span class="kw">int64_t</span> { 
<span class="kw">return</span> std::ftell(file.get()); };
+
+        <span class="cmt">// 2. Build an Arrow RecordBatch and write it</span>
+        arrow::Int32Builder age_builder;
+        arrow::StringBuilder name_builder;
+        arrow::DoubleBuilder score_builder;
+        <span class="kw">for</span> (<span class="kw">int</span> i = <span 
class="num">0</span>; i &lt; <span class="num">10000</span>; i++) {
+            name_builder.Append(<span class="str">"user_"</span> + 
std::to_string(i));
+            age_builder.Append(<span class="num">20</span> + (i % <span 
class="num">50</span>));
+            score_builder.Append(i * <span class="num">1.5</span>);
+        }
+        <span class="kw">auto</span> batch = arrow::RecordBatch::Make(
+            arrow::schema({
+                arrow::field(<span class="str">"name"</span>, arrow::utf8()),
+                arrow::field(<span class="str">"age"</span>, arrow::int32()),
+                arrow::field(<span class="str">"score"</span>, 
arrow::float64()),
+            }),
+            <span class="num">10000</span>,
+            {name_builder.Finish().ValueOrDie(),
+             age_builder.Finish().ValueOrDie(),
+             score_builder.Finish().ValueOrDie()});
+
+        <span class="cmt">// 3. Export via Arrow C Data Interface, create 
writer, and write</span>
+        ArrowArray ffi_array;
+        ArrowSchema ffi_schema;
+        arrow::ExportRecordBatch(*batch, &amp;ffi_array, &amp;ffi_schema);
+
+        <span class="ty">mosaic</span>::<span class="ty">Writer</span> 
writer(std::move(cbs), &amp;ffi_schema, {
+            .num_buckets = <span class="num">2</span>,

Review Comment:
   The C++ docs still say to compile with `-std=c++17`, but the examples use 
C++ designated initializers, which are a C++20 feature. This initializer is 
also out of declaration order for C++20 (`WriterOptions` declares `compression` 
before `num_buckets`), so strict compilation fails. Please either switch the 
docs/build commands to C++20 and order designators by the struct declaration, 
or avoid designated initializers and use `WriterOptions opts; opts.num_buckets 
= 2; opts.compression = 1; ...` so the examples match the documented C++17 
build command.



##########
docs/index.html:
##########
@@ -0,0 +1,194 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Mosaic</title>
+    <link rel="stylesheet" href="css/style.css">
+    <script src="js/main.js"></script>
+</head>
+<body>
+    <button class="menu-toggle" aria-label="Menu">&#9776;</button>
+    <div class="overlay"></div>
+
+    <aside class="sidebar">
+        <div class="sidebar-header">
+            <h2>Mosaic</h2>
+            <p>Columnar-bucket hybrid format</p>
+        </div>
+        <nav>
+            <ul>
+                <li><a href="index.html" class="active">Home</a></li>
+                <li><a href="design.html">Design</a></li>
+                <li><a href="rust-api.html">Rust API</a></li>
+                <li><a href="java-api.html">Java API</a></li>
+                <li><a href="python-api.html">Python API</a></li>
+                <li><a href="cpp-api.html">C++ API</a></li>
+            </ul>
+        </nav>
+        <div class="sidebar-footer">
+            <button class="theme-toggle">Dark Mode</button>
+        </div>
+    </aside>
+
+    <main class="main">
+        <div class="content">
+            <h1>Mosaic</h1>
+            <p class="subtitle">A columnar-bucket hybrid format optimized for 
wide tables.</p>
+
+            <div class="badges">
+                <span class="badge rust">Rust</span>
+                <span class="badge java">Java</span>
+                <span class="badge python">Python</span>
+                <span class="badge cpp">C/C++</span>
+            </div>
+
+            <h2>Overview</h2>
+            <p>
+                Mosaic is a columnar-bucket hybrid format optimized for wide 
tables (10,000+ columns).
+                Columns are sorted by name and evenly distributed into buckets 
using range-based assignment,
+                stored column-oriented within each bucket, and independently 
compressed.
+                This enables efficient projection pushdown at bucket 
granularity &mdash;
+                reading 10 columns out of 10,000 only decompresses the buckets 
that contain those 10 columns.
+                Range-based assignment ensures that columns with similar name 
prefixes land in the same bucket,
+                improving both compression ratio and projection locality.
+            </p>
+            <p>
+                Mosaic is implemented as a Rust core library with bindings for 
Java (via JNI),
+                Python (via ctypes FFI), and C/C++ (via FFI),
+                enabling high-performance read and write access across 
multiple language ecosystems.
+            </p>
+
+            <h2>Key Features</h2>
+            <div class="features">
+                <div class="feature">
+                    <h3>Columnar-Bucket Hybrid</h3>
+                    <p>Columns sorted by name are distributed into buckets via 
range-based assignment, enabling projection pushdown at bucket granularity. 
Similar name prefixes land in the same bucket.</p>
+                </div>
+                <div class="feature">
+                    <h3>Adaptive Encoding</h3>
+                    <p>Each column is automatically encoded as ALL_NULL, 
CONST, DICT, or PLAIN based on its data distribution.</p>
+                </div>
+                <div class="feature">
+                    <h3>Zstd Compression</h3>
+                    <p>Optional Zstandard compression per bucket and schema 
block, with configurable compression level. Each bucket is independently 
compressed.</p>
+                </div>
+                <div class="feature">
+                    <h3>BPE Name Compression</h3>
+                    <p>Byte Pair Encoding compresses column names in the 
schema block, reducing metadata overhead for wide tables.</p>
+                </div>
+                <div class="feature">
+                    <h3>Rich Type System</h3>
+                    <p>18 data types from Boolean to TimestampLtz, with 
support for fixed-width and variable-length encodings.</p>
+                </div>
+                <div class="feature">
+                    <h3>Multi-Language</h3>
+                    <p>Rust core with Java JNI bindings, Python ctypes 
bindings, and C/C++ FFI headers. Write once in Rust, use everywhere.</p>
+                </div>
+            </div>
+
+            <h2>Supported Types</h2>
+            <table>
+                <thead>
+                    <tr><th>Type</th><th>Width</th><th>Description</th></tr>
+                </thead>
+                <tbody>
+                    <tr><td><code>Boolean</code></td><td>1</td><td>true / 
false</td></tr>
+                    <tr><td><code>TinyInt</code></td><td>1</td><td>Signed 
8-bit integer</td></tr>
+                    <tr><td><code>SmallInt</code></td><td>2</td><td>Signed 
16-bit integer</td></tr>
+                    <tr><td><code>Integer</code></td><td>4</td><td>Signed 
32-bit integer</td></tr>
+                    <tr><td><code>BigInt</code></td><td>8</td><td>Signed 
64-bit integer</td></tr>
+                    <tr><td><code>Float</code></td><td>4</td><td>32-bit IEEE 
754</td></tr>
+                    <tr><td><code>Double</code></td><td>8</td><td>64-bit IEEE 
754</td></tr>
+                    <tr><td><code>Date</code></td><td>4</td><td>Days since 
epoch</td></tr>
+                    <tr><td><code>Time</code></td><td>4</td><td>Milliseconds 
since midnight</td></tr>
+                    
<tr><td><code>Char(n)</code></td><td>variable</td><td>Fixed-length 
string</td></tr>
+                    
<tr><td><code>VarChar(n)</code></td><td>variable</td><td>Variable-length string 
with max length</td></tr>
+                    
<tr><td><code>String</code></td><td>variable</td><td>Unbounded UTF-8 
string</td></tr>
+                    
<tr><td><code>Binary(n)</code></td><td>variable</td><td>Fixed-length byte 
array</td></tr>
+                    
<tr><td><code>VarBinary(n)</code></td><td>variable</td><td>Variable-length byte 
array with max length</td></tr>
+                    
<tr><td><code>Bytes</code></td><td>variable</td><td>Unbounded byte 
array</td></tr>
+                    <tr><td><code>Decimal(p, s)</code></td><td>8 or 
variable</td><td>Exact numeric; compact (p&le;18) or large</td></tr>
+                    <tr><td><code>Timestamp(p)</code></td><td>8 or 
12</td><td>Millis (p&le;3) or millis + nanos (p&gt;3)</td></tr>

Review Comment:
   This Timestamp summary does not match the implementation: 
`Timestamp(Millisecond)` and `Timestamp(Microsecond)` are both stored as 8-byte 
values, while only precision > 6 uses the 12-byte `{millis, nanos_of_milli}` 
representation. The detailed design page says this correctly; the home page 
should say something like `millis (p <= 3), micros (p <= 6), or millis + nanos 
(p > 6)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to