This is an automated email from the ASF dual-hosted git repository. lzljs3620320 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push: new 473124d636 [doc] Document file format for orc avro csv 473124d636 is described below commit 473124d6369f19a629b1c97fca1fd1d626c39dfb Author: JingsongLi <jingsongl...@gmail.com> AuthorDate: Wed Aug 20 14:03:10 2025 +0800 [doc] Document file format for orc avro csv --- docs/content/concepts/spec/fileformat.md | 350 ++++++++++++++++++++++++++++++- 1 file changed, 341 insertions(+), 9 deletions(-) diff --git a/docs/content/concepts/spec/fileformat.md b/docs/content/concepts/spec/fileformat.md index 9ad6164740..98798f7da0 100644 --- a/docs/content/concepts/spec/fileformat.md +++ b/docs/content/concepts/spec/fileformat.md @@ -26,10 +26,10 @@ under the License. # File Format -Currently, supports Parquet, Avro, ORC, JSON, CSV file formats. +Currently, supports Parquet, Avro, ORC, CSV, JSON file formats. - Recommended column format is Parquet, which has a high compression rate and fast column projection queries. - Recommended row based format is Avro, which has good performance n reading and writing full row (all columns). -- Recommended testing format is JSON, which has better readability but the worst storage and read-write performance. +- Recommended testing format is CSV, which has better readability but the worst read-write performance. ## PARQUET @@ -142,25 +142,357 @@ The following table lists the type mapping from Paimon type to Parquet type. Limitations: 1. [Parquet does not support nullable map keys](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps). +## AVRO + +The following table lists the type mapping from Paimon type to Avro type. + +<table class="table table-bordered"> + <thead> + <tr> + <th class="text-left">Paimon type</th> + <th class="text-left">Avro type</th> + <th class="text-left">Avro logical type</th> + </tr> + </thead> + <tbody> + <tr> + <td>CHAR / VARCHAR / STRING</td> + <td>string</td> + <td></td> + </tr> + <tr> + <td><code>BOOLEAN</code></td> + <td><code>boolean</code></td> + <td></td> + </tr> + <tr> + <td><code>BINARY / VARBINARY</code></td> + <td><code>bytes</code></td> + <td></td> + </tr> + <tr> + <td><code>DECIMAL</code></td> + <td><code>bytes</code></td> + <td><code>decimal</code></td> + </tr> + <tr> + <td><code>TINYINT</code></td> + <td><code>int</code></td> + <td></td> + </tr> + <tr> + <td><code>SMALLINT</code></td> + <td><code>int</code></td> + <td></td> + </tr> + <tr> + <td><code>INT</code></td> + <td><code>int</code></td> + <td></td> + </tr> + <tr> + <td><code>BIGINT</code></td> + <td><code>long</code></td> + <td></td> + </tr> + <tr> + <td><code>FLOAT</code></td> + <td><code>float</code></td> + <td></td> + </tr> + <tr> + <td><code>DOUBLE</code></td> + <td><code>double</code></td> + <td></td> + </tr> + <tr> + <td><code>DATE</code></td> + <td><code>int</code></td> + <td><code>date</code></td> + </tr> + <tr> + <td><code>TIME</code></td> + <td><code>int</code></td> + <td><code>time-millis</code></td> + </tr> + <tr> + <td><code>TIMESTAMP</code></td> + <td>P <= 3: long, P <= 6: long, P > 6: unsupported</td> + <td>P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported</td> + </tr> + <tr> + <td><code>TIMESTAMP_LOCAL_ZONE</code></td> + <td>P <= 3: long, P <= 6: long, P > 6: unsupported</td> + <td>P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported</td> + </tr> + <tr> + <td><code>ARRAY</code></td> + <td><code>array</code></td> + <td></td> + </tr> + <tr> + <td><code>MAP</code><br> + (key must be string/char/varchar type)</td> + <td><code>map</code></td> + <td></td> + </tr> + <tr> + <td><code>MULTISET</code><br> + (element must be string/char/varchar type)</td> + <td><code>map</code></td> + <td></td> + </tr> + <tr> + <td><code>ROW</code></td> + <td><code>record</code></td> + <td></td> + </tr> + </tbody> +</table> + +In addition to the types listed above, for nullable types. Paimon maps nullable types to Avro `union(something, null)`, +where `something` is the Avro type converted from Paimon type. + +You can refer to [Avro Specification](https://avro.apache.org/docs/1.12.0/specification/) for more information about Avro types. + ## ORC -TODO +The following table lists the type mapping from Paimon type to Orc type. + +<table class="table table-bordered"> + <thead> + <tr> + <th class="text-left">Paimon Type</th> + <th class="text-center">Orc physical type</th> + <th class="text-center">Orc logical type</th> + </tr> + </thead> + <tbody> + <tr> + <td>CHAR</td> + <td>bytes</td> + <td>CHAR</td> + </tr> + <tr> + <td>VARCHAR</td> + <td>bytes</td> + <td>VARCHAR</td> + </tr> + <tr> + <td>STRING</td> + <td>bytes</td> + <td>STRING</td> + </tr> + <tr> + <td>BOOLEAN</td> + <td>long</td> + <td>BOOLEAN</td> + </tr> + <tr> + <td>BYTES</td> + <td>bytes</td> + <td>BINARY</td> + </tr> + <tr> + <td>DECIMAL</td> + <td>decimal</td> + <td>DECIMAL</td> + </tr> + <tr> + <td>TINYINT</td> + <td>long</td> + <td>BYTE</td> + </tr> + <tr> + <td>SMALLINT</td> + <td>long</td> + <td>SHORT</td> + </tr> + <tr> + <td>INT</td> + <td>long</td> + <td>INT</td> + </tr> + <tr> + <td>BIGINT</td> + <td>long</td> + <td>LONG</td> + </tr> + <tr> + <td>FLOAT</td> + <td>double</td> + <td>FLOAT</td> + </tr> + <tr> + <td>DOUBLE</td> + <td>double</td> + <td>DOUBLE</td> + </tr> + <tr> + <td>DATE</td> + <td>long</td> + <td>DATE</td> + </tr> + <tr> + <td>TIMESTAMP</td> + <td>timestamp</td> + <td>TIMESTAMP</td> + </tr> + <tr> + <td>TIMESTAMP_LOCAL_ZONE</td> + <td>timestamp</td> + <td>TIMESTAMP_INSTANT</td> + </tr> + <tr> + <td>ARRAY</td> + <td>-</td> + <td>LIST</td> + </tr> + <tr> + <td>MAP</td> + <td>-</td> + <td>MAP</td> + </tr> + <tr> + <td>ROW</td> + <td>-</td> + <td>STRUCT</td> + </tr> + </tbody> +</table> Limitations: 1. ORC has a time zone bias when mapping `TIMESTAMP_LOCAL_ZONE` type, saving the millis value corresponding to the UTC literal time. Due to compatibility issues, this behavior cannot be modified. -## AVRO +## CSV -TODO +Experimental feature, not recommended for production. -## JSON +Format Options: -Experimental feature, not recommended for production. +<table class="table table-bordered"> + <thead> + <tr> + <th class="text-left" style="width: 25%">Option</th> + <th class="text-center" style="width: 7%">Default</th> + <th class="text-center" style="width: 10%">Type</th> + <th class="text-center" style="width: 42%">Description</th> + </tr> + </thead> + <tbody> + <tr> + <td><h5>csv.field-delimiter</h5></td> + <td style="word-wrap: break-word;"><code>,</code></td> + <td>String</td> + <td>Field delimiter character (<code>','</code> by default), must be single character. You can use backslash to specify special characters, e.g. <code>'\t'</code> represents the tab character. + </td> + </tr> + <tr> + <td><h5>csv.line-delimiter</h5></td> + <td style="word-wrap: break-word;"><code>\n</code></td> + <td>String</td> + <td>The line delimiter for CSV format</td> + </tr> + <tr> + <td><h5>csv.quote-character</h5></td> + <td style="word-wrap: break-word;"><code>"</code></td> + <td>String</td> + <td>Quote character for enclosing field values (<code>"</code> by default).</td> + </tr> + <tr> + <td><h5>csv.escape-character</h5></td> + <td style="word-wrap: break-word;">\</td> + <td>String</td> + <td>The escape character for CSV format.</td> + </tr> + <tr> + <td><h5>csv.include-header</h5></td> + <td style="word-wrap: break-word;">false</td> + <td>Boolean</td> + <td>Whether to include header in CSV files.</td> + </tr> + <tr> + <td><h5>csv.null-literal</h5></td> + <td style="word-wrap: break-word;">(none)</td> + <td>String</td> + <td>Null literal string that is interpreted as a null value (disabled by default).</td> + </tr> + </tbody> +</table> -TODO +Paimon CSV format uses [jackson databind API](https://github.com/FasterXML/jackson-databind) to parse and generate CSV string. -## CSV +The following table lists the type mapping from Paimon type to CSV type. + +<table class="table table-bordered"> + <thead> + <tr> + <th class="text-left">Paimon type</th> + <th class="text-left">CSV type</th> + </tr> + </thead> + <tbody> + <tr> + <td><code>CHAR / VARCHAR / STRING</code></td> + <td><code>string</code></td> + </tr> + <tr> + <td><code>BOOLEAN</code></td> + <td><code>boolean</code></td> + </tr> + <tr> + <td><code>BINARY / VARBINARY</code></td> + <td><code>string with encoding: base64</code></td> + </tr> + <tr> + <td><code>DECIMAL</code></td> + <td><code>number</code></td> + </tr> + <tr> + <td><code>TINYINT</code></td> + <td><code>number</code></td> + </tr> + <tr> + <td><code>SMALLINT</code></td> + <td><code>number</code></td> + </tr> + <tr> + <td><code>INT</code></td> + <td><code>number</code></td> + </tr> + <tr> + <td><code>BIGINT</code></td> + <td><code>number</code></td> + </tr> + <tr> + <td><code>FLOAT</code></td> + <td><code>number</code></td> + </tr> + <tr> + <td><code>DOUBLE</code></td> + <td><code>number</code></td> + </tr> + <tr> + <td><code>DATE</code></td> + <td><code>string with format: date</code></td> + </tr> + <tr> + <td><code>TIME</code></td> + <td><code>string with format: time</code></td> + </tr> + <tr> + <td><code>TIMESTAMP</code></td> + <td><code>string with format: date-time</code></td> + </tr> + <tr> + <td><code>TIMESTAMP_LOCAL_ZONE</code></td> + <td><code>string with format: date-time</code></td> + </tr> + </tbody> +</table> + +## JSON Experimental feature, not recommended for production.