[GitHub] [iceberg] Ranjith-AR commented on pull request #4401: Spark: add procedure to generate symlink manifests

GitBox Mon, 28 Mar 2022 21:08:17 -0700


Ranjith-AR commented on pull request #4401:
URL: https://github.com/apache/iceberg/pull/4401#issuecomment-1081379394



   
   
   > Add a Spark procedure to generate symlink manifests, so that systems 
without Iceberg support can read Iceberg table data using an external table:
   > 
   > ```sql
   > CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
   > [PARTITIONED BY (col_name2 col_datatype2, ...)]
   > ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
   > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   > LOCATION '<symlink-table-root-path>' 
   > ```
   > 
   > I did not add an action for this because this is to just give a gateway 
for users with any existing query engine that does not natively support Iceberg 
(in my case it's Redshift Spectrum) to start reading Iceberg, because most 
engines support Hive with symlink input format to some extent. If we think it 
deserves an action in core API, I can also add that.
   > 
   > The procedure looks like:
   > 
   > ```
   > CALL catalog.system.generate_symlink_format_manifest(
   >   table => 'table_name', 
   >   symlink_root_location => 's3://some/path'
   > );
   > ```
   > 
   > The `symlink_root_location` is optional. The default is 
`<table_root>/_symlink_format_manifest/<snapshot_id>`. A snapshot ID suffix is 
added because if this procedure is executed twice against the same table, we 
don't want to mix the results if the table is updated. If users want to use a 
consistent root path for the symlink table, it could be input as an override.
   > 
   > I thought about adding another option for `snapshot_id` in the input, so 
we can generate a symlink table for any historical snapshots, but decided to 
not do that to avoid making the procedure too complicated. We can add it as a 
follow up if needed.
   > 
   > The procedure currently returns the `snapshot_id` that the procedure is 
executed against, and `data_file_count` for the number of data files in the 
symlink manifests.
   > 
   > Regarding partitioning, the generated symlink table exposes all the hidden 
partitions, and use the union of all historical table partition specs. For 
example, if the table is partitioned by spec1 `category`, spec 2 `bucket(16, 
id)`, users are expected to create a symlink table with `PARTITIONED BY 
(id_bucket int, category string)`.
   > 
   > Regarding merge-on-read, generated symlink table does not consider delete 
files. This is basically a "snapshot view" of the table. A compaction is needed 
to generate the most up-to-date view of the table.
   
   **in the above default location 
"<table_root>/_symlink_format_manifest/<snapshot_id>", 
   what is the name of the manifest file ? <snapshot_id>_manifest ? 
   
   does this mean that when I regenerate the symlink_format_manifest, i have to 
alter the external table location ?
   
   how to execute the procedure to override / exclude the snapshot_id prefix 
and generate manifest file like below
   "<table_root>/_symlink_format_manifest/manifest" ?**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Ranjith-AR commented on pull request #4401: Spark: add procedure to generate symlink manifests

Reply via email to