This is an automated email from the ASF dual-hosted git repository.
djwang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/cloudberry-pxf.git
The following commit(s) were added to refs/heads/main by this push:
new 0376ea75 docs - pxf works with ORC using Foreign Data Wrapper (#52)
0376ea75 is described below
commit 0376ea75226a1dbd73f68af8df5278424a3c47af
Author: Nikolay Antonov <[email protected]>
AuthorDate: Thu Feb 5 12:50:26 2026 +0500
docs - pxf works with ORC using Foreign Data Wrapper (#52)
Add examples on how to use PXF foreign data wrappers to access ORC files on
S3/HDFS.
---
.../master_middleman/source/subnavs/pxf-subnav.erb | 4 +-
docs/content/hdfs_orc.html.md.erb | 57 ++++++++++++++++++++--
docs/content/objstore_avro.html.md.erb | 2 +-
docs/content/objstore_orc.html.md.erb | 47 +++++++++++++++++-
4 files changed, 103 insertions(+), 7 deletions(-)
diff --git a/docs/book/master_middleman/source/subnavs/pxf-subnav.erb
b/docs/book/master_middleman/source/subnavs/pxf-subnav.erb
index f2d178c3..fbf2c28f 100644
--- a/docs/book/master_middleman/source/subnavs/pxf-subnav.erb
+++ b/docs/book/master_middleman/source/subnavs/pxf-subnav.erb
@@ -64,7 +64,7 @@
<li><a href="/pxf/WIP/hdfs_text.html" format="markdown">Reading and
Writing Text Data</a></li>
<li><a href="/pxf/WIP/hdfs_avro.html" format="markdown">Reading and
Writing Avro Data</a></li>
<li><a href="/pxf/WIP/hdfs_json.html" format="markdown">Reading JSON
Data</a></li>
- <li><a href="/pxf/WIP/hdfs_orc.html" format="markdown">Reading ORC
Data</a></li>
+ <li><a href="/pxf/WIP/hdfs_orc.html" format="markdown">Reading and
Writing ORC Data</a></li>
<li><a href="/pxf/WIP/hdfs_parquet.html" format="markdown">Reading
and Writing Parquet Data</a></li>
<li><a href="/pxf/WIP/hdfs_seqfile.html" format="markdown">Reading
and Writing SequenceFile Data</a></li>
<li><a href="/pxf/WIP/hdfs_fileasrow.html" format="markdown">Reading
a Multi-Line Text File into a Single Table Row</a></li>
@@ -79,7 +79,7 @@
<li><a href="/pxf/WIP/objstore_text.html" format="markdown">Reading
and Writing Text Data</a></li>
<li><a href="/pxf/WIP/objstore_avro.html" format="markdown">Reading
and Writing Avro Data</a></li>
<li><a href="/pxf/WIP/objstore_json.html" format="markdown">Reading
JSON Data</a></li>
- <li><a href="/pxf/WIP/objstore_orc.html" format="markdown">Reading
ORC Data</a></li>
+ <li><a href="/pxf/WIP/objstore_orc.html" format="markdown">Reading
and Writing ORC Data</a></li>
<li><a href="/pxf/WIP/objstore_parquet.html"
format="markdown">Reading and Writing Parquet Data</a></li>
<li><a href="/pxf/WIP/objstore_seqfile.html"
format="markdown">Reading and Writing SequenceFile Data</a></li>
<li><a href="/pxf/WIP/objstore_fileasrow.html"
format="markdown">Reading a Multi-Line Text File into a Single Table
Row</a></li>
diff --git a/docs/content/hdfs_orc.html.md.erb
b/docs/content/hdfs_orc.html.md.erb
index f429ee2b..69b248cc 100644
--- a/docs/content/hdfs_orc.html.md.erb
+++ b/docs/content/hdfs_orc.html.md.erb
@@ -34,7 +34,7 @@ ORC is type-aware and specifically designed for Hadoop
workloads. ORC files stor
ORC also supports predicate pushdown with built-in indexes at the file,
stripe, and row levels, moving the filter operation to the data loading phase.
-Refer to the [Apache orc](https://orc.apache.org/docs/) documentation for
detailed information about the ORC file format.
+Refer to the [Apache ORC](https://orc.apache.org/docs/) documentation for
detailed information about the ORC file format.
## <a id="datatype_map"></a>Data Type Mapping
@@ -137,7 +137,7 @@ FORMAT 'CUSTOM'
(FORMATTER='pxfwritable_import'|'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```
-The specific keywords and values used in the Greenplum Database [CREATE
EXTERNAL
TABLE](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-sql_commands-CREATE_EXTERNAL_TABLE.html)
command are described below.
+The specific keywords and values used in the Apache Cloudberry [CREATE
EXTERNAL
TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-external-table)
command are described below.
| Keyword | Value |
|-------|-------------------------------------|
@@ -148,6 +148,31 @@ The specific keywords and values used in the Greenplum
Database [CREATE EXTERNAL
| FORMAT 'CUSTOM' | Use `FORMAT 'CUSTOM'`with
`(FORMATTER='pxfwritable_export')` (write) or
`(FORMATTER='pxfwritable_import')` (read). |
| DISTRIBUTED BY | If you want to load data from an existing Greenplum
Database table into the writable external table, consider specifying the same
distribution policy or `<column_name>` on both tables. Doing so will avoid
extra motion of data between segments on the load operation. |
+## <a id="profile_cfdw"></a>Creating the Foreign Table
+
+The PXF HDFS `hdfs_pxf_fdw` foreign data wrapper supports reading and writing
ORC-formatted HDFS files. When you insert records into a foreign table, the
block(s) of data that you insert are written to one file per segment in the
directory that you specified in the `resource` clause.
+
+Use the following syntax to create an Apache Cloudberry foreign table that
references an HDFS file or directory:
+
+``` sql
+CREATE SERVER <foreign_server> FOREIGN DATA WRAPPER hdfs_pxf_fdw;
+CREATE USER MAPPING FOR <user_name> SERVER <foreign_server>;
+
+CREATE FOREIGN TABLE [ IF NOT EXISTS ] <table_name>
+ ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+ SERVER <foreign_server>
+ OPTIONS ( resource '<path-to-file>', format 'orc' [, <custom-option>
'<value>'[...]]);
+```
+
+The specific keywords and values used in the Apache Cloudberry [CREATE FOREIGN
TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-foreign-table)
command are described below.
+
+| Keyword | Value |
+|-------|-------------------------------------|
+| \<foreign_server\> | The named server configuration that PXF uses to
access the data. You can override credentials in `CREATE SERVER` statement as
described in [Overriding the S3 Server Configuration for Foreign
Tables](access_s3.html#s3_override_fdw) |
+| \<path‑to‑hdfs‑file\> | The path to the file or
directory in the HDFS data store. When the `<server_name>` configuration
includes a [`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property
setting, PXF considers \<path‑to‑hdfs‑file\> to be relative
to the base path specified. Otherwise, PXF considers it to be an absolute path.
\<path‑to‑hdfs‑file\> must not specify a relative path nor
include the dollar sign (`$`) character. |
+| format | The file format; specify `'orc'` for ORC-formatted data. |
+| \<custom-option\> | \<custom-option\>s are described below. |
+
<a id="customopts"></a>
The PXF `hdfs:orc` profile supports the following *read* options. You specify
this option in the `LOCATION` clause:
@@ -164,7 +189,7 @@ The PXF `hdfs:orc` profile supports a single
compression-related write option; y
## <a id="write"></a>About Writing ORC data
-When you insert records into a writable external table, the block(s) of data
that you insert are written to one or more files in the directory that you
specify in the `LOCATION` clause.
+When you insert records into a writable external table, the block(s) of data
that you insert are written to one or more files in the directory that you
specify in the `LOCATION` or `resource` clauses.
When you insert ORC data records, the `pxf.orc.write.timezone.utc` property in
the `pxf-site.xml` file governs how PXF writes timestamp values to the external
data store. By default, PXF writes a timestamp type using the UTC time zone. If
you require PXF to write a timestamp type using the local time zone of the PXF
JVM, set the `pxf.orc.write.timezone.utc` property to `false` for the server
and synchronize the PXF configuration.
@@ -231,6 +256,19 @@ Procedure:
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
+ OR
+
+ ``` sql
+ testdb=# CREATE SERVER example_orc FOREIGN DATA WRAPPER hdfs_pxf_fdw;
+ testdb=# CREATE USER MAPPING FOR CURRENT_USER SERVER example_orc;
+ testdb=# CREATE FOREIGN TABLE sample_orc(location text, month text,
num_orders int, total_sales numeric(10,2), items_sold text[])
+ SERVER example_orc
+ OPTIONS (
+ resource 'data/pxf_examples/orc_example',
+ format 'orc'
+ );
+ ```
+
1. Read the data in the file by querying the `sample_orc` table:
``` sql
@@ -288,6 +326,19 @@ In this example, you create a writable external table to
write some data to the
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```
+ OR
+
+ ```
+ testdb=# CREATE SERVER example_orc FOREIGN DATA WRAPPER hdfs_pxf_fdw;
+ testdb=# CREATE USER MAPPING FOR CURRENT_USER SERVER example_orc;
+ testdb=# CREATE FOREIGN TABLE sample_orc(location text, month text,
num_orders int, total_sales numeric(10,2), items_sold text[])
+ SERVER example_orc
+ OPTIONS (
+ resource 'data/pxf_examples/orc_example',
+ format 'orc'
+ );
+ ```
+
1. Write a few records to segment files in the `orc_example` directory by
inserting into the `write_to_sample_orc` table:
``` sql
diff --git a/docs/content/objstore_avro.html.md.erb
b/docs/content/objstore_avro.html.md.erb
index 708c1df8..2b742e22 100644
--- a/docs/content/objstore_avro.html.md.erb
+++ b/docs/content/objstore_avro.html.md.erb
@@ -102,7 +102,7 @@ CREATE FOREIGN TABLE [ IF NOT EXISTS ] <table_name>
| Keyword | Value |
|-------|-------------------------------------|
-| \<foreign_server\> | The named server configuration that PXF uses to
access the data. You can override credentials in `CREATE SERVER` statement as
described in [Overriding the S3 Server Configuration for External
Tables](access_s3.html#s3_override_fdw) |
+| \<foreign_server\> | The named server configuration that PXF uses to
access the data. You can override credentials in `CREATE SERVER` statement as
described in [Overriding the S3 Server Configuration for Foreign
Tables](access_s3.html#s3_override_fdw) |
| resource \<path‑to‑file\> | The path to the directory or file
in the object store. When the `<server_name>` configuration includes a
[`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF
considers \<path‑to‑file\> to be relative to the base path
specified. Otherwise, PXF considers it to be an absolute path.
\<path‑to‑file\> must not specify a relative path nor include the
dollar sign (`$`) character. |
| format 'avro' | File format specification. |
| \<custom‑option\>=\<value\> | Avro-specific custom options are
described in the [PXF HDFS Avro documentation](hdfs_avro.html#customopts). |
diff --git a/docs/content/objstore_orc.html.md.erb
b/docs/content/objstore_orc.html.md.erb
index d24f2f0b..069fb578 100644
--- a/docs/content/objstore_orc.html.md.erb
+++ b/docs/content/objstore_orc.html.md.erb
@@ -39,7 +39,7 @@ FORMAT 'CUSTOM'
(FORMATTER='pxfwritable_import'|'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```
-The specific keywords and values used in the Greenplum Database [CREATE
EXTERNAL
TABLE](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-sql_commands-CREATE_EXTERNAL_TABLE.html)
command are described in the table below.
+The specific keywords and values used in the Apache Cloudberry [CREATE
EXTERNAL
TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-external-table)
command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
@@ -52,6 +52,38 @@ The specific keywords and values used in the Greenplum
Database [CREATE EXTERNAL
If you are accessing an S3 object store, you can provide S3 credentials via
custom options in the `CREATE EXTERNAL TABLE` command as described in
[Overriding the S3 Server Configuration for External
Tables](access_s3.html#s3_override_ext).
+## <a id="orc_cfdw"></a>Creating the Foreign Table
+
+Use one of the following foreign data wrappers with `format 'orc'`.
+
+| Object Store | Foreign Data Wrapper |
+|-------|-------------------------------------|
+| Azure Blob Storage | wasbs_pxf_fdw |
+| Azure Data Lake Storage Gen2 | abfss_pxf_fdw |
+| Google Cloud Storage | gs_pxf_fdw |
+| MinIO | s3_pxf_fdw |
+| S3 | s3_pxf_fdw |
+
+The following syntax creates a Apache Cloudberry foreign table that references
an ORC-format file:
+
+``` sql
+CREATE SERVER <foreign_server> FOREIGN DATA WRAPPER <store>_pxf_fdw;
+CREATE USER MAPPING FOR <user_name> SERVER <foreign_server>;
+
+CREATE FOREIGN TABLE [ IF NOT EXISTS ] <table_name>
+ ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+ SERVER <foreign_server>
+ OPTIONS ( resource '<path-to-file>', format 'orc' [, <custom-option>
'<value>' [, ...] ]);
+```
+
+| Keyword | Value |
+|-------|-------------------------------------|
+| \<foreign_server\> | The named server configuration that PXF uses to
access the data. You can override credentials in `CREATE SERVER` statement as
described in [Overriding the S3 Server Configuration for Foreign
Tables](access_s3.html#s3_override_fdw) |
+| resource \<path‑to‑file\> | The path to the directory or file
in the object store. When the `<server_name>` configuration includes a
[`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF
considers \<path‑to‑file\> to be relative to the base path
specified. Otherwise, PXF considers it to be an absolute path.
\<path‑to‑file\> must not specify a relative path nor include the
dollar sign (`$`) character. |
+| format 'orc' | The file format; specify `'orc'` for ORC-formatted data. |
+| \<custom‑option\>=\<value\> | ORC-specific custom options are
described in the [PXF HDFS ORC documentation](hdfs_orc.html#customopts). |
+
+
## <a id="example"></a>Example
Refer to [Example: Reading an ORC File on HDFS](hdfs_orc.html#read_example) in
the PXF Hadoop ORC documentation for an example. Modifications that you must
make to run the example with an object store include:
@@ -78,3 +110,16 @@ Refer to [Example: Reading an ORC File on
HDFS](hdfs_orc.html#read_example) in t
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```
+- Or using the `CREATE FOREIGN TABLE` syntax create one foreign table to read
and write operations. For example, if your server name is `s3srvcfg`:
+
+ ``` sql
+ CREATE SERVER s3srvcfg FOREIGN DATA WRAPPER s3_pxf_fdw;
+ CREATE USER MAPPING FOR CURRENT_USER SERVER s3srvcfg;
+
+ CREATE FOREIGN TABLE sample_orc (location TEXT, month TEXT, num_orders
INT, total_sales NUMERIC(10,2), items_sold TEXT[])
+ SERVER s3srvcfg
+ OPTIONS (
+ resource 'BUCKET/pxf_examples/orc_example',
+ format 'orc'
+ )
+ ```
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]