leaves12138 commented on code in PR #9: URL: https://github.com/apache/paimon-mosaic/pull/9#discussion_r3265356538
########## docs/cpp-api.html: ########## @@ -0,0 +1,400 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +<!DOCTYPE html> +<html lang="en"> +<head> + <meta charset="UTF-8"> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <title>C++ API - Mosaic</title> + <link rel="stylesheet" href="css/style.css"> + <script src="js/main.js"></script> +</head> +<body> + <button class="menu-toggle" aria-label="Menu">☰</button> + <div class="overlay"></div> + + <aside class="sidebar"> + <div class="sidebar-header"> + <h2>Mosaic</h2> + <p>Columnar-bucket hybrid format</p> + </div> + <nav> + <ul> + <li><a href="index.html">Home</a></li> + <li><a href="design.html">Design</a></li> + <li><a href="rust-api.html">Rust API</a></li> + <li><a href="java-api.html">Java API</a></li> + <li><a href="python-api.html">Python API</a></li> + <li><a href="cpp-api.html">C++ API</a></li> + </ul> + </nav> + <div class="sidebar-footer"> + <button class="theme-toggle">Dark Mode</button> + </div> + </aside> + + <main class="main"> + <div class="content"> + <h1>C++ API</h1> + <p class="subtitle">Use Mosaic from C or C++ via the FFI bindings.</p> + + <h2>Overview</h2> + <p> + The <code>ffi/</code> crate generates a shared library (<code>libmosaic_ffi</code>) and a + C header (<code>mosaic.h</code>) via <a href="https://github.com/mozilla/cbindgen">cbindgen</a>. + The C++ header (<code>mosaic.hpp</code>) is a hand-written RAII wrapper on top of the C API. + </p> + + <h2>Building</h2> +<pre><code><span class="cmt"># Build the FFI shared library</span> +cargo build --release -p mosaic-ffi + +<span class="cmt"># C header generated at include/mosaic.h</span> +<span class="cmt"># C++ RAII wrapper: include/mosaic.hpp (checked in, not generated)</span></code></pre> + + <h2>Linking</h2> + <p>Link against the shared library and include the appropriate header:</p> +<pre><code><span class="cmt"># macOS</span> +g++ -std=c++17 -I include/ example.cpp \ + -L target/release -lmosaic_ffi -o example + +<span class="cmt"># Linux</span> +g++ -std=c++17 -I include/ example.cpp \ + -L target/release -lmosaic_ffi -Wl,-rpath,target/release -o example</code></pre> + + <h2>Writing (C++)</h2> + <p> + Data is written as Arrow RecordBatches via the + <a href="https://arrow.apache.org/docs/format/CDataInterface.html">Arrow C Data Interface</a>. + Build your data as Arrow arrays, export via <code>ArrowArray</code> / <code>ArrowSchema</code>, + then pass to the writer: + </p> +<pre><code><span class="kw">#include</span> <span class="str">"mosaic.hpp"</span> +<span class="kw">#include</span> <arrow/api.h> +<span class="kw">#include</span> <arrow/c/bridge.h> + +<span class="kw">int</span> <span class="fn">main</span>() { + <span class="kw">try</span> { + <span class="cmt">// 1. Set up output stream callbacks</span> + <span class="kw">auto</span>* fp = std::fopen(<span class="str">"output.mosaic"</span>, <span class="str">"wb"</span>); + <span class="kw">auto</span> file = std::shared_ptr<FILE>(fp, [](<span class="ty">FILE</span>* f) { std::fclose(f); }); + + <span class="ty">mosaic</span>::<span class="ty">OutputFile</span> cbs; + cbs.write_fn = [file](<span class="kw">const uint8_t</span>* data, <span class="kw">size_t</span> len) -> <span class="kw">int</span> { + <span class="kw">size_t</span> written = std::fwrite(data, <span class="num">1</span>, len, file.get()); + <span class="kw">return</span> (written == len) ? <span class="num">0</span> : <span class="num">-1</span>; + }; + cbs.flush_fn = [file]() -> <span class="kw">int</span> { <span class="kw">return</span> std::fflush(file.get()); }; + cbs.get_pos_fn = [file]() -> <span class="kw">int64_t</span> { <span class="kw">return</span> std::ftell(file.get()); }; + + <span class="cmt">// 2. Build an Arrow RecordBatch and write it</span> + arrow::Int32Builder age_builder; + arrow::StringBuilder name_builder; + arrow::DoubleBuilder score_builder; + <span class="kw">for</span> (<span class="kw">int</span> i = <span class="num">0</span>; i < <span class="num">10000</span>; i++) { + name_builder.Append(<span class="str">"user_"</span> + std::to_string(i)); + age_builder.Append(<span class="num">20</span> + (i % <span class="num">50</span>)); + score_builder.Append(i * <span class="num">1.5</span>); + } + <span class="kw">auto</span> batch = arrow::RecordBatch::Make( + arrow::schema({ + arrow::field(<span class="str">"name"</span>, arrow::utf8()), + arrow::field(<span class="str">"age"</span>, arrow::int32()), + arrow::field(<span class="str">"score"</span>, arrow::float64()), + }), + <span class="num">10000</span>, + {name_builder.Finish().ValueOrDie(), + age_builder.Finish().ValueOrDie(), + score_builder.Finish().ValueOrDie()}); + + <span class="cmt">// 3. Export via Arrow C Data Interface, create writer, and write</span> + ArrowArray ffi_array; + ArrowSchema ffi_schema; + arrow::ExportRecordBatch(*batch, &ffi_array, &ffi_schema); + + <span class="ty">mosaic</span>::<span class="ty">Writer</span> writer(std::move(cbs), &ffi_schema, { + .num_buckets = <span class="num">2</span>, Review Comment: The C++ docs still say to compile with `-std=c++17`, but the examples use C++ designated initializers, which are a C++20 feature. This initializer is also out of declaration order for C++20 (`WriterOptions` declares `compression` before `num_buckets`), so strict compilation fails. Please either switch the docs/build commands to C++20 and order designators by the struct declaration, or avoid designated initializers and use `WriterOptions opts; opts.num_buckets = 2; opts.compression = 1; ...` so the examples match the documented C++17 build command. ########## docs/index.html: ########## @@ -0,0 +1,194 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +<!DOCTYPE html> +<html lang="en"> +<head> + <meta charset="UTF-8"> + <meta name="viewport" content="width=device-width, initial-scale=1.0"> + <title>Mosaic</title> + <link rel="stylesheet" href="css/style.css"> + <script src="js/main.js"></script> +</head> +<body> + <button class="menu-toggle" aria-label="Menu">☰</button> + <div class="overlay"></div> + + <aside class="sidebar"> + <div class="sidebar-header"> + <h2>Mosaic</h2> + <p>Columnar-bucket hybrid format</p> + </div> + <nav> + <ul> + <li><a href="index.html" class="active">Home</a></li> + <li><a href="design.html">Design</a></li> + <li><a href="rust-api.html">Rust API</a></li> + <li><a href="java-api.html">Java API</a></li> + <li><a href="python-api.html">Python API</a></li> + <li><a href="cpp-api.html">C++ API</a></li> + </ul> + </nav> + <div class="sidebar-footer"> + <button class="theme-toggle">Dark Mode</button> + </div> + </aside> + + <main class="main"> + <div class="content"> + <h1>Mosaic</h1> + <p class="subtitle">A columnar-bucket hybrid format optimized for wide tables.</p> + + <div class="badges"> + <span class="badge rust">Rust</span> + <span class="badge java">Java</span> + <span class="badge python">Python</span> + <span class="badge cpp">C/C++</span> + </div> + + <h2>Overview</h2> + <p> + Mosaic is a columnar-bucket hybrid format optimized for wide tables (10,000+ columns). + Columns are sorted by name and evenly distributed into buckets using range-based assignment, + stored column-oriented within each bucket, and independently compressed. + This enables efficient projection pushdown at bucket granularity — + reading 10 columns out of 10,000 only decompresses the buckets that contain those 10 columns. + Range-based assignment ensures that columns with similar name prefixes land in the same bucket, + improving both compression ratio and projection locality. + </p> + <p> + Mosaic is implemented as a Rust core library with bindings for Java (via JNI), + Python (via ctypes FFI), and C/C++ (via FFI), + enabling high-performance read and write access across multiple language ecosystems. + </p> + + <h2>Key Features</h2> + <div class="features"> + <div class="feature"> + <h3>Columnar-Bucket Hybrid</h3> + <p>Columns sorted by name are distributed into buckets via range-based assignment, enabling projection pushdown at bucket granularity. Similar name prefixes land in the same bucket.</p> + </div> + <div class="feature"> + <h3>Adaptive Encoding</h3> + <p>Each column is automatically encoded as ALL_NULL, CONST, DICT, or PLAIN based on its data distribution.</p> + </div> + <div class="feature"> + <h3>Zstd Compression</h3> + <p>Optional Zstandard compression per bucket and schema block, with configurable compression level. Each bucket is independently compressed.</p> + </div> + <div class="feature"> + <h3>BPE Name Compression</h3> + <p>Byte Pair Encoding compresses column names in the schema block, reducing metadata overhead for wide tables.</p> + </div> + <div class="feature"> + <h3>Rich Type System</h3> + <p>18 data types from Boolean to TimestampLtz, with support for fixed-width and variable-length encodings.</p> + </div> + <div class="feature"> + <h3>Multi-Language</h3> + <p>Rust core with Java JNI bindings, Python ctypes bindings, and C/C++ FFI headers. Write once in Rust, use everywhere.</p> + </div> + </div> + + <h2>Supported Types</h2> + <table> + <thead> + <tr><th>Type</th><th>Width</th><th>Description</th></tr> + </thead> + <tbody> + <tr><td><code>Boolean</code></td><td>1</td><td>true / false</td></tr> + <tr><td><code>TinyInt</code></td><td>1</td><td>Signed 8-bit integer</td></tr> + <tr><td><code>SmallInt</code></td><td>2</td><td>Signed 16-bit integer</td></tr> + <tr><td><code>Integer</code></td><td>4</td><td>Signed 32-bit integer</td></tr> + <tr><td><code>BigInt</code></td><td>8</td><td>Signed 64-bit integer</td></tr> + <tr><td><code>Float</code></td><td>4</td><td>32-bit IEEE 754</td></tr> + <tr><td><code>Double</code></td><td>8</td><td>64-bit IEEE 754</td></tr> + <tr><td><code>Date</code></td><td>4</td><td>Days since epoch</td></tr> + <tr><td><code>Time</code></td><td>4</td><td>Milliseconds since midnight</td></tr> + <tr><td><code>Char(n)</code></td><td>variable</td><td>Fixed-length string</td></tr> + <tr><td><code>VarChar(n)</code></td><td>variable</td><td>Variable-length string with max length</td></tr> + <tr><td><code>String</code></td><td>variable</td><td>Unbounded UTF-8 string</td></tr> + <tr><td><code>Binary(n)</code></td><td>variable</td><td>Fixed-length byte array</td></tr> + <tr><td><code>VarBinary(n)</code></td><td>variable</td><td>Variable-length byte array with max length</td></tr> + <tr><td><code>Bytes</code></td><td>variable</td><td>Unbounded byte array</td></tr> + <tr><td><code>Decimal(p, s)</code></td><td>8 or variable</td><td>Exact numeric; compact (p≤18) or large</td></tr> + <tr><td><code>Timestamp(p)</code></td><td>8 or 12</td><td>Millis (p≤3) or millis + nanos (p>3)</td></tr> Review Comment: This Timestamp summary does not match the implementation: `Timestamp(Millisecond)` and `Timestamp(Microsecond)` are both stored as 8-byte values, while only precision > 6 uses the 12-byte `{millis, nanos_of_milli}` representation. The detailed design page says this correctly; the home page should say something like `millis (p <= 3), micros (p <= 6), or millis + nanos (p > 6)`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
