This is an automated email from the ASF dual-hosted git repository. djwang pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/cloudberry-pxf.git
commit 5e4b57fe63356c9c616b8318391073c6855b129b Author: Lisa Owen <[email protected]> AuthorDate: Wed Sep 14 16:37:35 2022 -0600 docs - misc updates/additions to the orc write docs (#864) * docs - misc updates/additions to the orc write docs * reword per bradford * more rewording per bradford --- docs/content/hdfs_orc.html.md.erb | 30 ++++++++++++++++++++++-------- docs/content/hdfs_parquet.html.md.erb | 2 +- docs/content/objstore_orc.html.md.erb | 8 ++++---- 3 files changed, 27 insertions(+), 13 deletions(-) diff --git a/docs/content/hdfs_orc.html.md.erb b/docs/content/hdfs_orc.html.md.erb index b424063c..b4e2098c 100644 --- a/docs/content/hdfs_orc.html.md.erb +++ b/docs/content/hdfs_orc.html.md.erb @@ -124,7 +124,7 @@ PXF supports writing the list ORC compound type for one-dimensional arrays, for ## <a id="createexttbl"></a>Creating the External Table -The PXF HDFS connector `hdfs:orc` profile supports reading and writing ORC-formatted HDFS files. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified. +The PXF HDFS connector `hdfs:orc` profile supports reading and writing ORC-formatted HDFS files. When you insert records into a writable external table, the block(s) of data that you insert are written to one file per segment in the directory that you specified. Use the following syntax to create a Greenplum Database external table that references an HDFS file or directory: @@ -156,6 +156,17 @@ The PXF `hdfs:orc` profile supports the following *read* options. You specify th | IGNORE_MISSING_PATH | A Boolean value that specifies the action to take when \<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. | | MAP_BY_POSITION | A Boolean value that, when set to `true`, specifies that PXF should map an ORC column to a Greenplum Database column by position. The default value is `false`, PXF maps an ORC column to a Greenplum column by name. | +The PXF `hdfs:orc` profile supports a single compression-related write option; you specify this option in the `CREATE WRITABLE EXTERNAL TABLE` `LOCATION` clause: + +| Write Option | Value Description | +|-------|-------------------------------------| +| COMPRESSION_CODEC | The compression codec alias. Supported compression codecs for writing ORC data include: `lz4`, `lzo`, `zstd`, `snappy`, `zlib`, and `none` . If this option is not specified, PXF compresses the data using `zlib` compression. | + +## <a id="write"></a>About Writing ORC data + +When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specify in the `LOCATION` clause. + +When you insert ORC data records, the `pxf.orc.write.timezone.utc` property in the `pxf-site.xml` file governs how PXF writes timestamp values to the external data store. By default, PXF writes a timestamp type using the UTC time zone. If you require PXF to write a timestamp type using the local time zone of the PXF JVM, set the `pxf.orc.write.timezone.utc` property to `false` for the server and synchronize the PXF configuration. ## <a id="read_example"></a>Example: Reading an ORC File on HDFS @@ -184,10 +195,10 @@ Procedure: ``` shell $ echo '{"location": "Prague", "month": "Jan","num_orders": 101, "total_sales": 4875.33, "items_sold": ["boots", "hats"]} - {"location": "Rome", "month": "Mar","num_orders": 87, "total_sales": 1557.39, "items_sold": ["coats"]} - {"location": "Bangalore", "month": "May","num_orders": 317, "total_sales": 8936.99, "items_sold": ["winter socks", "long-sleeved shirts", "boots"]} - {"location": "Beijing", "month": "Jul","num_orders": 411, "total_sales": 11600.67, "items_sold": ["hoodies/sweaters", "pants"]} - {"location": "Los Angeles", "month": "Dec","num_orders": 0, "total_sales": 0.00, "items_sold": null} + {"location": "Rome", "month": "Mar","num_orders": 87, "total_sales": 1557.39, "items_sold": ["coats"]} + {"location": "Bangalore", "month": "May","num_orders": 317, "total_sales": 8936.99, "items_sold": ["winter socks", "long-sleeved shirts", "boots"]} + {"location": "Beijing", "month": "Jul","num_orders": 411, "total_sales": 11600.67, "items_sold": ["hoodies/sweaters", "pants"]} + {"location": "Los Angeles", "month": "Dec","num_orders": 0, "total_sales": 0.00, "items_sold": null} ' > /tmp/sampledata.json ``` @@ -268,7 +279,9 @@ Procedure: ## <a id="write_example"></a>Example: Writing to an ORC File on HDFS -1. Use the `hdfs:orc` profile to create a writable external table that references `sample_orc`: +In this example, you create a writable external table to write some data to the directory referenced by the `sample_orc` table. + +1. Create an external table that specifies the `hdfs:orc` profile and the HDFS directory `/data/pxf_examples/orc_example` in the `LOCATION` URL: ``` sql postgres=# CREATE WRITABLE EXTERNAL TABLE write_to_sample_orc (location text, month text, num_orders int, total_sales numeric(10,2), items_sold text[] ) @@ -276,15 +289,16 @@ Procedure: FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export'); ``` -1. Write a few records to segment files in the `orc_example` directory by inserting to the `write_to_sample_orc` table: +1. Write a few records to segment files in the `orc_example` directory by inserting into the `write_to_sample_orc` table: ``` sql postgres=# INSERT INTO write_to_sample_orc VALUES ( 'Frankfurt', 'Mar', 777, 3956.98, '{"winter socks","pants",boots}' ); postgres=# INSERT INTO write_to_sample_orc VALUES ( 'Cleveland', 'Oct', 3218, 96645.37, '{"long-sleeved shirts",hats}' ); ``` -1. Recall that Greenplum Database does not support directly querying a writable external table. Query the `sample_orc` table to read the data you added to files in the `orc_example` directory: +1. Recall that Greenplum Database does not support directly querying a writable external table. Query the `sample_orc` table that you created in the previous example to read the new data that you added: ``` sql postgres=# SELECT * FROM sample_orc ORDER BY num_orders; ``` + diff --git a/docs/content/hdfs_parquet.html.md.erb b/docs/content/hdfs_parquet.html.md.erb index d879abbd..df400d58 100644 --- a/docs/content/hdfs_parquet.html.md.erb +++ b/docs/content/hdfs_parquet.html.md.erb @@ -88,7 +88,7 @@ PXF uses the following data type mapping when writing Parquet data: | Text | utf8 | binary | | OTHERS | -- | UNSUPPORTED | -</br><sup>1</sup> PXF localizes a <code>Timestamp</code> to the current system timezone and converts it to universal time (UTC) before finally converting to <code>int96</code>. +</br><sup>1</sup> PXF localizes a <code>Timestamp</code> to the current system time zone and converts it to universal time (UTC) before finally converting to <code>int96</code>. </br><sup>2</sup> PXF converts a <code>Timestamptz</code> to a UTC <code>timestamp</code> and then converts to <code>int96</code>. PXF loses the time zone information during this conversion. diff --git a/docs/content/objstore_orc.html.md.erb b/docs/content/objstore_orc.html.md.erb index 7a0c0c90..03d64fa2 100644 --- a/docs/content/objstore_orc.html.md.erb +++ b/docs/content/objstore_orc.html.md.erb @@ -56,17 +56,17 @@ If you are accessing an S3 object store, you can provide S3 credentials via cust Refer to [Example: Reading an ORC File on HDFS](hdfs_orc.html#read_example) in the PXF Hadoop ORC documentation for an example. Modifications that you must make to run the example with an object store include: -- Copying the segment files to the object store instead of HDFS. For example, to copy the files to S3: +- Copying the ORC file to the object store instead of HDFS. For example, to copy the file to S3: ``` shell - $ aws s3 cp /tmp/sample-orc-data s3://BUCKET/pxf_examples/ + $ aws s3 cp /tmp/sampledata.orc s3://BUCKET/pxf_examples/orc_example/ ``` - Using the `CREATE EXTERNAL TABLE` syntax and `LOCATION` keywords and settings described above. For example, if your server name is `s3srvcfg`: ``` sql CREATE EXTERNAL TABLE sample_orc( location TEXT, month TEXT, num_orders INTEGER, total_sales NUMERIC(10,2), items_sold TEXT[] ) - LOCATION('pxf://BUCKET/pxf_examples/sample-orc-data?PROFILE=s3:orc&SERVER=s3srvcfg') + LOCATION('pxf://BUCKET/pxf_examples/orc_example?PROFILE=s3:orc&SERVER=s3srvcfg') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); ``` @@ -74,7 +74,7 @@ Refer to [Example: Reading an ORC File on HDFS](hdfs_orc.html#read_example) in t ``` sql CREATE WRITABLE EXTERNAL TABLE write_to_sample_orc (location TEXT, month TEXT, num_orders INT, total_sales NUMERIC(10,2), items_sold TEXT[]) - LOCATION ('pxf://BUCKET/pxf_examples/sample-orc-data?PROFILE=s3:orc&SERVER=s3srvcfg') + LOCATION ('pxf://BUCKET/pxf_examples/orc_example?PROFILE=s3:orc&SERVER=s3srvcfg') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export'); ``` --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
