This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git


The following commit(s) were added to refs/heads/master by this push:
     new 473124d636 [doc] Document file format for orc avro csv
473124d636 is described below

commit 473124d6369f19a629b1c97fca1fd1d626c39dfb
Author: JingsongLi <jingsongl...@gmail.com>
AuthorDate: Wed Aug 20 14:03:10 2025 +0800

    [doc] Document file format for orc avro csv
---
 docs/content/concepts/spec/fileformat.md | 350 ++++++++++++++++++++++++++++++-
 1 file changed, 341 insertions(+), 9 deletions(-)

diff --git a/docs/content/concepts/spec/fileformat.md 
b/docs/content/concepts/spec/fileformat.md
index 9ad6164740..98798f7da0 100644
--- a/docs/content/concepts/spec/fileformat.md
+++ b/docs/content/concepts/spec/fileformat.md
@@ -26,10 +26,10 @@ under the License.
 
 # File Format
 
-Currently, supports Parquet, Avro, ORC, JSON, CSV file formats.
+Currently, supports Parquet, Avro, ORC, CSV, JSON file formats.
 - Recommended column format is Parquet, which has a high compression rate and 
fast column projection queries.
 - Recommended row based format is Avro, which has good performance n reading 
and writing full row (all columns).
-- Recommended testing format is JSON, which has better readability but the 
worst storage and read-write performance.
+- Recommended testing format is CSV, which has better readability but the 
worst read-write performance.
 
 ## PARQUET
 
@@ -142,25 +142,357 @@ The following table lists the type mapping from Paimon 
type to Parquet type.
 Limitations:
 1. [Parquet does not support nullable map 
keys](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps).
 
+## AVRO
+
+The following table lists the type mapping from Paimon type to Avro type.
+
+<table class="table table-bordered">
+    <thead>
+      <tr>
+        <th class="text-left">Paimon type</th>
+        <th class="text-left">Avro type</th>
+        <th class="text-left">Avro logical type</th>
+      </tr>
+    </thead>
+    <tbody>
+    <tr>
+      <td>CHAR / VARCHAR / STRING</td>
+      <td>string</td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>BOOLEAN</code></td>
+      <td><code>boolean</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>BINARY / VARBINARY</code></td>
+      <td><code>bytes</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>DECIMAL</code></td>
+      <td><code>bytes</code></td>
+      <td><code>decimal</code></td>
+    </tr>
+    <tr>
+      <td><code>TINYINT</code></td>
+      <td><code>int</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>SMALLINT</code></td>
+      <td><code>int</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>INT</code></td>
+      <td><code>int</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>BIGINT</code></td>
+      <td><code>long</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>FLOAT</code></td>
+      <td><code>float</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>DOUBLE</code></td>
+      <td><code>double</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>DATE</code></td>
+      <td><code>int</code></td>
+      <td><code>date</code></td>
+    </tr>
+    <tr>
+      <td><code>TIME</code></td>
+      <td><code>int</code></td>
+      <td><code>time-millis</code></td>
+    </tr>
+    <tr>
+      <td><code>TIMESTAMP</code></td>
+      <td>P <= 3: long, P <= 6: long, P > 6: unsupported</td>
+      <td>P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: 
unsupported</td>
+    </tr>
+    <tr>
+      <td><code>TIMESTAMP_LOCAL_ZONE</code></td>
+      <td>P <= 3: long, P <= 6: long, P > 6: unsupported</td>
+      <td>P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: 
unsupported</td>
+    </tr>
+    <tr>
+      <td><code>ARRAY</code></td>
+      <td><code>array</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>MAP</code><br>
+      (key must be string/char/varchar type)</td>
+      <td><code>map</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>MULTISET</code><br>
+      (element must be string/char/varchar type)</td>
+      <td><code>map</code></td>
+      <td></td>
+    </tr>
+    <tr>
+      <td><code>ROW</code></td>
+      <td><code>record</code></td>
+      <td></td>
+    </tr>
+    </tbody>
+</table>
+
+In addition to the types listed above, for nullable types. Paimon maps 
nullable types to Avro `union(something, null)`,
+where `something` is the Avro type converted from Paimon type.
+
+You can refer to [Avro 
Specification](https://avro.apache.org/docs/1.12.0/specification/) for more 
information about Avro types.
+
 ## ORC
 
-TODO
+The following table lists the type mapping from Paimon type to Orc type.
+
+<table class="table table-bordered">
+    <thead>
+      <tr>
+        <th class="text-left">Paimon Type</th>
+        <th class="text-center">Orc physical type</th>
+        <th class="text-center">Orc logical type</th>
+      </tr>
+    </thead>
+    <tbody>
+    <tr>
+      <td>CHAR</td>
+      <td>bytes</td>
+      <td>CHAR</td>
+    </tr>
+    <tr>
+      <td>VARCHAR</td>
+      <td>bytes</td>
+      <td>VARCHAR</td>
+    </tr>
+    <tr>
+      <td>STRING</td>
+      <td>bytes</td>
+      <td>STRING</td>
+    </tr>
+    <tr>
+      <td>BOOLEAN</td>
+      <td>long</td>
+      <td>BOOLEAN</td>
+    </tr>
+    <tr>
+      <td>BYTES</td>
+      <td>bytes</td>
+      <td>BINARY</td>
+    </tr>
+    <tr>
+      <td>DECIMAL</td>
+      <td>decimal</td>
+      <td>DECIMAL</td>
+    </tr>
+    <tr>
+      <td>TINYINT</td>
+      <td>long</td>
+      <td>BYTE</td>
+    </tr>
+    <tr>
+      <td>SMALLINT</td>
+      <td>long</td>
+      <td>SHORT</td>
+    </tr>
+    <tr>
+      <td>INT</td>
+      <td>long</td>
+      <td>INT</td>
+    </tr>
+    <tr>
+      <td>BIGINT</td>
+      <td>long</td>
+      <td>LONG</td>
+    </tr>
+    <tr>
+      <td>FLOAT</td>
+      <td>double</td>
+      <td>FLOAT</td>
+    </tr>
+    <tr>
+      <td>DOUBLE</td>
+      <td>double</td>
+      <td>DOUBLE</td>
+    </tr>
+    <tr>
+      <td>DATE</td>
+      <td>long</td>
+      <td>DATE</td>
+    </tr>
+    <tr>
+      <td>TIMESTAMP</td>
+      <td>timestamp</td>
+      <td>TIMESTAMP</td>
+    </tr>
+    <tr>
+      <td>TIMESTAMP_LOCAL_ZONE</td>
+      <td>timestamp</td>
+      <td>TIMESTAMP_INSTANT</td>
+    </tr>
+    <tr>
+      <td>ARRAY</td>
+      <td>-</td>
+      <td>LIST</td>
+    </tr>
+    <tr>
+      <td>MAP</td>
+      <td>-</td>
+      <td>MAP</td>
+    </tr>
+    <tr>
+      <td>ROW</td>
+      <td>-</td>
+      <td>STRUCT</td>
+    </tr>
+    </tbody>
+</table>
 
 Limitations:
 1. ORC has a time zone bias when mapping `TIMESTAMP_LOCAL_ZONE` type, saving 
the millis value corresponding to the UTC
    literal time. Due to compatibility issues, this behavior cannot be modified.
 
-## AVRO
+## CSV
 
-TODO
+Experimental feature, not recommended for production.
 
-## JSON
+Format Options:
 
-Experimental feature, not recommended for production.
+<table class="table table-bordered">
+    <thead>
+      <tr>
+        <th class="text-left" style="width: 25%">Option</th>
+        <th class="text-center" style="width: 7%">Default</th>
+        <th class="text-center" style="width: 10%">Type</th>
+        <th class="text-center" style="width: 42%">Description</th>
+      </tr>
+    </thead>
+    <tbody>
+    <tr>
+      <td><h5>csv.field-delimiter</h5></td>
+      <td style="word-wrap: break-word;"><code>,</code></td>
+      <td>String</td>
+      <td>Field delimiter character (<code>','</code> by default), must be 
single character. You can use backslash to specify special characters, e.g. 
<code>'\t'</code> represents the tab character.
+      </td>
+    </tr>
+    <tr>
+      <td><h5>csv.line-delimiter</h5></td>
+      <td style="word-wrap: break-word;"><code>\n</code></td>
+      <td>String</td>
+      <td>The line delimiter for CSV format</td>
+    </tr>
+    <tr>
+      <td><h5>csv.quote-character</h5></td>
+      <td style="word-wrap: break-word;"><code>"</code></td>
+      <td>String</td>
+      <td>Quote character for enclosing field values (<code>"</code> by 
default).</td>
+    </tr>
+    <tr>
+      <td><h5>csv.escape-character</h5></td>
+      <td style="word-wrap: break-word;">\</td>
+      <td>String</td>
+      <td>The escape character for CSV format.</td>
+    </tr>
+   <tr>
+      <td><h5>csv.include-header</h5></td>
+      <td style="word-wrap: break-word;">false</td>
+      <td>Boolean</td>
+      <td>Whether to include header in CSV files.</td>
+    </tr>
+    <tr>
+      <td><h5>csv.null-literal</h5></td>
+      <td style="word-wrap: break-word;">(none)</td>
+      <td>String</td>
+      <td>Null literal string that is interpreted as a null value (disabled by 
default).</td>
+    </tr>
+    </tbody>
+</table>
 
-TODO
+Paimon CSV format uses [jackson databind 
API](https://github.com/FasterXML/jackson-databind) to parse and generate CSV 
string.
 
-## CSV
+The following table lists the type mapping from Paimon type to CSV type.
+
+<table class="table table-bordered">
+    <thead>
+      <tr>
+        <th class="text-left">Paimon type</th>
+        <th class="text-left">CSV type</th>
+      </tr>
+    </thead>
+    <tbody>
+    <tr>
+      <td><code>CHAR / VARCHAR / STRING</code></td>
+      <td><code>string</code></td>
+    </tr>
+    <tr>
+      <td><code>BOOLEAN</code></td>
+      <td><code>boolean</code></td>
+    </tr>
+    <tr>
+      <td><code>BINARY / VARBINARY</code></td>
+      <td><code>string with encoding: base64</code></td>
+    </tr>
+    <tr>
+      <td><code>DECIMAL</code></td>
+      <td><code>number</code></td>
+    </tr>
+    <tr>
+      <td><code>TINYINT</code></td>
+      <td><code>number</code></td>
+    </tr>
+    <tr>
+      <td><code>SMALLINT</code></td>
+      <td><code>number</code></td>
+    </tr>
+    <tr>
+      <td><code>INT</code></td>
+      <td><code>number</code></td>
+    </tr>
+    <tr>
+      <td><code>BIGINT</code></td>
+      <td><code>number</code></td>
+    </tr>
+    <tr>
+      <td><code>FLOAT</code></td>
+      <td><code>number</code></td>
+    </tr>
+    <tr>
+      <td><code>DOUBLE</code></td>
+      <td><code>number</code></td>
+    </tr>
+    <tr>
+      <td><code>DATE</code></td>
+      <td><code>string with format: date</code></td>
+    </tr>
+    <tr>
+      <td><code>TIME</code></td>
+      <td><code>string with format: time</code></td>
+    </tr>
+    <tr>
+      <td><code>TIMESTAMP</code></td>
+      <td><code>string with format: date-time</code></td>
+    </tr>
+    <tr>
+      <td><code>TIMESTAMP_LOCAL_ZONE</code></td>
+      <td><code>string with format: date-time</code></td>
+    </tr>
+    </tbody>
+</table>
+
+## JSON
 
 Experimental feature, not recommended for production.
 

Reply via email to