Ranjith-AR commented on pull request #4401: URL: https://github.com/apache/iceberg/pull/4401#issuecomment-1081379394
> Add a Spark procedure to generate symlink manifests, so that systems without Iceberg support can read Iceberg table data using an external table: > > ```sql > CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)]) > [PARTITIONED BY (col_name2 col_datatype2, ...)] > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION '<symlink-table-root-path>' > ``` > > I did not add an action for this because this is to just give a gateway for users with any existing query engine that does not natively support Iceberg (in my case it's Redshift Spectrum) to start reading Iceberg, because most engines support Hive with symlink input format to some extent. If we think it deserves an action in core API, I can also add that. > > The procedure looks like: > > ``` > CALL catalog.system.generate_symlink_format_manifest( > table => 'table_name', > symlink_root_location => 's3://some/path' > ); > ``` > > The `symlink_root_location` is optional. The default is `<table_root>/_symlink_format_manifest/<snapshot_id>`. A snapshot ID suffix is added because if this procedure is executed twice against the same table, we don't want to mix the results if the table is updated. If users want to use a consistent root path for the symlink table, it could be input as an override. > > I thought about adding another option for `snapshot_id` in the input, so we can generate a symlink table for any historical snapshots, but decided to not do that to avoid making the procedure too complicated. We can add it as a follow up if needed. > > The procedure currently returns the `snapshot_id` that the procedure is executed against, and `data_file_count` for the number of data files in the symlink manifests. > > Regarding partitioning, the generated symlink table exposes all the hidden partitions, and use the union of all historical table partition specs. For example, if the table is partitioned by spec1 `category`, spec 2 `bucket(16, id)`, users are expected to create a symlink table with `PARTITIONED BY (id_bucket int, category string)`. > > Regarding merge-on-read, generated symlink table does not consider delete files. This is basically a "snapshot view" of the table. A compaction is needed to generate the most up-to-date view of the table. **in the above default location "<table_root>/_symlink_format_manifest/<snapshot_id>", what is the name of the manifest file ? <snapshot_id>_manifest ? does this mean that when I regenerate the symlink_format_manifest, i have to alter the external table location ? how to execute the procedure to override / exclude the snapshot_id prefix and generate manifest file like below "<table_root>/_symlink_format_manifest/manifest" ?** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
