additions to the orc write docs (#864)

djwang Mon, 14 Apr 2025 20:19:21 -0700

This is an automated email from the ASF dual-hosted git repository.

djwang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/cloudberry-pxf.git


commit 5e4b57fe63356c9c616b8318391073c6855b129b
Author: Lisa Owen <[email protected]>
AuthorDate: Wed Sep 14 16:37:35 2022 -0600

    docs - misc updates/additions to the orc write docs (#864)
    
    * docs - misc updates/additions to the orc write docs
    
    * reword per bradford
    
    * more rewording per bradford
---
 docs/content/hdfs_orc.html.md.erb     | 30 ++++++++++++++++++++++--------
 docs/content/hdfs_parquet.html.md.erb |  2 +-
 docs/content/objstore_orc.html.md.erb |  8 ++++----
 3 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/docs/content/hdfs_orc.html.md.erb 
b/docs/content/hdfs_orc.html.md.erb
index b424063c..b4e2098c 100644
--- a/docs/content/hdfs_orc.html.md.erb
+++ b/docs/content/hdfs_orc.html.md.erb
@@ -124,7 +124,7 @@ PXF supports writing the list ORC compound type for 
one-dimensional arrays, for
 
 ## <a id="createexttbl"></a>Creating the External Table
 
-The PXF HDFS connector `hdfs:orc` profile supports reading and writing 
ORC-formatted HDFS files. When you insert records into a writable external 
table, the block(s) of data that you insert are written to one or more files in 
the directory that you specified.
+The PXF HDFS connector `hdfs:orc` profile supports reading and writing 
ORC-formatted HDFS files. When you insert records into a writable external 
table, the block(s) of data that you insert are written to one file per segment 
in the directory that you specified.
 
 Use the following syntax to create a Greenplum Database external table that 
references an HDFS file or directory:
 
@@ -156,6 +156,17 @@ The PXF `hdfs:orc` profile supports the following *read* 
options. You specify th
 | IGNORE_MISSING_PATH | A Boolean value that specifies the action to take when 
\<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF 
returns an error in this situation. When the value is `true`, PXF ignores 
missing path errors and returns an empty fragment. |
 | MAP_BY_POSITION | A Boolean value that, when set to `true`, specifies that 
PXF should map an ORC column to a Greenplum Database column by position. The 
default value is `false`, PXF maps an ORC column to a Greenplum column by name. 
|
 
+The PXF `hdfs:orc` profile supports a single compression-related write option; 
you specify this option in the `CREATE WRITABLE EXTERNAL TABLE` `LOCATION` 
clause:
+
+| Write Option  | Value Description |
+|-------|-------------------------------------|
+| COMPRESSION_CODEC    | The compression codec alias. Supported compression 
codecs for writing ORC data include: `lz4`, `lzo`, `zstd`, `snappy`, `zlib`, 
and `none` . If this option is not specified, PXF compresses the data using 
`zlib` compression. |
+
+## <a id="write"></a>About Writing ORC data
+
+When you insert records into a writable external table, the block(s) of data 
that you insert are written to one or more files in the directory that you 
specify in the `LOCATION` clause.
+
+When you insert ORC data records, the `pxf.orc.write.timezone.utc` property in 
the `pxf-site.xml` file governs how PXF writes timestamp values to the external 
data store. By default, PXF writes a timestamp type using the UTC time zone. If 
you require PXF to write a timestamp type using the local time zone of the PXF 
JVM, set the `pxf.orc.write.timezone.utc` property to `false` for the server 
and synchronize the PXF configuration.
 
 ## <a id="read_example"></a>Example: Reading an ORC File on HDFS
 
@@ -184,10 +195,10 @@ Procedure:
 
     ``` shell
     $ echo '{"location": "Prague", "month": "Jan","num_orders": 101, 
"total_sales": 4875.33, "items_sold": ["boots", "hats"]}
-  {"location": "Rome", "month": "Mar","num_orders": 87, "total_sales": 
1557.39, "items_sold": ["coats"]}
-  {"location": "Bangalore", "month": "May","num_orders": 317, "total_sales": 
8936.99, "items_sold": ["winter socks", "long-sleeved shirts", "boots"]}
-  {"location": "Beijing", "month": "Jul","num_orders": 411, "total_sales": 
11600.67, "items_sold": ["hoodies/sweaters", "pants"]}
-  {"location": "Los Angeles", "month": "Dec","num_orders": 0, "total_sales": 
0.00, "items_sold": null}
+    {"location": "Rome", "month": "Mar","num_orders": 87, "total_sales": 
1557.39, "items_sold": ["coats"]}
+    {"location": "Bangalore", "month": "May","num_orders": 317, "total_sales": 
8936.99, "items_sold": ["winter socks", "long-sleeved shirts", "boots"]}
+    {"location": "Beijing", "month": "Jul","num_orders": 411, "total_sales": 
11600.67, "items_sold": ["hoodies/sweaters", "pants"]}
+    {"location": "Los Angeles", "month": "Dec","num_orders": 0, "total_sales": 
0.00, "items_sold": null}
 ' > /tmp/sampledata.json
     ```
 
@@ -268,7 +279,9 @@ Procedure:
 
 ## <a id="write_example"></a>Example: Writing to an ORC File on HDFS
 
-1. Use the `hdfs:orc` profile to create a writable external table that 
references `sample_orc`:
+In this example, you create a writable external table to write some data to 
the directory referenced by the `sample_orc` table.
+
+1. Create an external table that specifies the `hdfs:orc` profile and the HDFS 
directory `/data/pxf_examples/orc_example` in the `LOCATION` URL:
 
     ``` sql
     postgres=# CREATE WRITABLE EXTERNAL TABLE write_to_sample_orc (location 
text, month text, num_orders int, total_sales numeric(10,2), items_sold text[] )
@@ -276,15 +289,16 @@ Procedure:
       FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
     ```
 
-1. Write a few records to segment files in the `orc_example` directory by 
inserting to the `write_to_sample_orc` table:
+1. Write a few records to segment files in the `orc_example` directory by 
inserting into the `write_to_sample_orc` table:
    
     ``` sql
     postgres=# INSERT INTO write_to_sample_orc VALUES ( 'Frankfurt', 'Mar', 
777, 3956.98, '{"winter socks","pants",boots}' );
     postgres=# INSERT INTO write_to_sample_orc VALUES ( 'Cleveland', 'Oct', 
3218, 96645.37, '{"long-sleeved shirts",hats}' );
     ```
 
-1. Recall that Greenplum Database does not support directly querying a 
writable external table. Query the `sample_orc` table to read the data you 
added to files in the `orc_example` directory:
+1. Recall that Greenplum Database does not support directly querying a 
writable external table. Query the `sample_orc` table that you created in the 
previous example to read the new data that you added:
 
     ``` sql
     postgres=# SELECT * FROM sample_orc ORDER BY num_orders;
     ```
+
diff --git a/docs/content/hdfs_parquet.html.md.erb 
b/docs/content/hdfs_parquet.html.md.erb
index d879abbd..df400d58 100644
--- a/docs/content/hdfs_parquet.html.md.erb
+++ b/docs/content/hdfs_parquet.html.md.erb
@@ -88,7 +88,7 @@ PXF uses the following data type mapping when writing Parquet 
data:
 | Text | utf8 | binary |
 | OTHERS | -- | UNSUPPORTED |
 
-</br><sup>1</sup>&nbsp;PXF localizes a <code>Timestamp</code> to the current 
system timezone and converts it to universal time (UTC) before finally 
converting to <code>int96</code>.
+</br><sup>1</sup>&nbsp;PXF localizes a <code>Timestamp</code> to the current 
system time zone and converts it to universal time (UTC) before finally 
converting to <code>int96</code>.
 </br><sup>2</sup>&nbsp;PXF converts a <code>Timestamptz</code> to a UTC 
<code>timestamp</code> and then converts to <code>int96</code>. PXF loses the 
time zone information during this conversion.
 
 
diff --git a/docs/content/objstore_orc.html.md.erb 
b/docs/content/objstore_orc.html.md.erb
index 7a0c0c90..03d64fa2 100644
--- a/docs/content/objstore_orc.html.md.erb
+++ b/docs/content/objstore_orc.html.md.erb
@@ -56,17 +56,17 @@ If you are accessing an S3 object store, you can provide S3 
credentials via cust
 
 Refer to [Example: Reading an ORC File on HDFS](hdfs_orc.html#read_example) in 
the PXF Hadoop ORC documentation for an example. Modifications that you must 
make to run the example with an object store include:
 
-- Copying the segment files to the object store instead of HDFS. For example, 
to copy the files to S3:
+- Copying the ORC file to the object store instead of HDFS. For example, to 
copy the file to S3:
 
     ``` shell
-    $ aws s3 cp /tmp/sample-orc-data s3://BUCKET/pxf_examples/
+    $ aws s3 cp /tmp/sampledata.orc s3://BUCKET/pxf_examples/orc_example/
     ```
 
 - Using the `CREATE EXTERNAL TABLE` syntax and `LOCATION` keywords and 
settings described above. For example, if your server name is `s3srvcfg`:
 
     ``` sql
     CREATE EXTERNAL TABLE sample_orc( location TEXT, month TEXT, num_orders 
INTEGER, total_sales NUMERIC(10,2), items_sold TEXT[] )
-      
LOCATION('pxf://BUCKET/pxf_examples/sample-orc-data?PROFILE=s3:orc&SERVER=s3srvcfg')
+      
LOCATION('pxf://BUCKET/pxf_examples/orc_example?PROFILE=s3:orc&SERVER=s3srvcfg')
     FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
     ```
 
@@ -74,7 +74,7 @@ Refer to [Example: Reading an ORC File on 
HDFS](hdfs_orc.html#read_example) in t
 
     ``` sql
     CREATE WRITABLE EXTERNAL TABLE write_to_sample_orc (location TEXT, month 
TEXT, num_orders INT, total_sales NUMERIC(10,2), items_sold TEXT[])
-      LOCATION 
('pxf://BUCKET/pxf_examples/sample-orc-data?PROFILE=s3:orc&SERVER=s3srvcfg')
+      LOCATION 
('pxf://BUCKET/pxf_examples/orc_example?PROFILE=s3:orc&SERVER=s3srvcfg')
     FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
     ```
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(cloudberry-pxf) 08/14: docs - misc updates/additions to the orc write docs (#864)

Reply via email to