hive> describe formatted mytable3; OK # col_name data_type comment value int
# Detailed Table Information Database: mydb OwnerType: USER Owner: root CreateTime: Wed Aug 11 20:02:14 UTC 2021 LastAccessTime: Sun Jan 11 15:25:29 UTC 1970 Retention: 2147483647 Location: hdfs://namenode:8020/tmp/test3.ice Table Type: EXTERNAL_TABLE Table Parameters: EXTERNAL TRUE metadata_location hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json numFiles 8 storage_handler org.apache.iceberg.mr.hive.HiveIcebergStorageHandler table_type ICEBERG totalSize 12577 transient_lastDdlTime 1628712134 # Storage Information SerDe Library: org.apache.iceberg.mr.hive.HiveIcebergSerDe InputFormat: org.apache.iceberg.mr.hive.HiveIcebergInputFormat OutputFormat: org.apache.iceberg.mr.hive.HiveIcebergOutputFormat Compressed: No Num Buckets: 0 Bucket Columns: [] Sort Columns: [] Time taken: 0.319 seconds, Fetched: 29 row(s) hive> select * from mytable3; FAILED: SemanticException Table does not exist at location: hdfs://namenode:8020/tmp/test3.ice Thanks! On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <b...@tabular.io> wrote: > Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there > is a storage handler set up for it. > > On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <jiangok2...@gmail.com> wrote: > >> Thanks guys. tableProperty("location", ...) works. >> >> I have trouble making hive query an iceberg table by following >> https://iceberg.apache.org/hive/. >> >> I have done: >> * in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;` >> * in hive-site.xml, add hive.vectorized.execution.enabled=false and >> iceberg.engine.hive.enabled=true. >> The same hive-site.xml is used by both hive server and spark. >> >> >> This is my code: >> val table = "hive_test.mydb.mytable3" >> val filePath = "hdfs://namenode:8020/tmp/test3.ice" >> df.writeTo(table) >> .tableProperty("write.format.default", "parquet") >> .tableProperty("location", filePath) >> .createOrReplace() >> >> >> The iceberg file is created in the specified location. It can be queried >> in spark sql. >> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/ >> Found 2 items >> drwxrwxr-x - root supergroup 0 2021-08-11 20:02 >> /tmp/test3.ice/data >> drwxrwxr-x - root supergroup 0 2021-08-11 20:02 >> /tmp/test3.ice/metadata >> >> This hive table is created but cannot be queried: >> hive> select * from mytable3; >> FAILED: SemanticException Table does not exist at location: >> hdfs://namenode:8020/tmp/test3.ice >> >> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very >> close to having a happy path for migrating parquet to iceberg. Thanks. >> >> >> >> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <b...@tabular.io> wrote: >> >>> The problem for #3 is how Spark handles the options. The option method >>> sets write options, not table properties. The write options aren’t passed >>> when creating the table. Instead, you should use tableProperty("location", >>> ...). >>> >>> Ryan >>> >>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive >>>> integration from iceberg. So you shouldn't see it in hive unless you have >>>> configured that, see https://iceberg.apache.org/hive/. >>>> >>>> 3) >>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137 >>>> I would check what properties are set in the table to see why that >>>> wasn't set. But "location" would be the correct way of setting the table. >>>> Unless the property is being ignored by Spark, I'm assuming you are using >>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which >>>> ignores options passed to the V2 api sometimes, >>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in >>>> 3.1 >>>> >>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <jiangok2...@gmail.com> wrote: >>>> >>>> Any help is highly appreciated! >>>> >>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Russell. >>>>> >>>>> I tried: >>>>> >>>>> /spark/bin/spark-shell --packages >>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog >>>>> --conf spark.sql.catalog.hive_test.type=hive >>>>> >>>>> import org.apache.spark.sql.SparkSession >>>>> val values = List(1,2,3,4,5) >>>>> >>>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>>> import spark.implicits._ >>>>> val df = values.toDF() >>>>> >>>>> val table = "hive_test.mydb.mytable3" >>>>> df.writeTo(table) >>>>> .tableProperty("write.format.default", "parquet") >>>>> * .option("location", "hdfs://namenode:8020/tmp/test.ice")* >>>>> .createOrReplace() >>>>> >>>>> spark.table(table).show() >>>>> >>>>> *Observations*: >>>>> 1. spark.table(table).show() does show the table correctly. >>>>> +-----+ >>>>> |value| >>>>> +-----+ >>>>> | 1| >>>>> | 2| >>>>> | 3| >>>>> | 4| >>>>> | 5| >>>>> +-----+ >>>>> >>>>> 2. mydb.mytable3 is created in HIVE but it is empty: >>>>> hive> select * from mytable3; >>>>> OK >>>>> Time taken: 0.158 seconds >>>>> >>>>> 3. test.ice is not generated in the HDFS folder /tmp. >>>>> >>>>> Any idea about 2 and 3? Thanks very much. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> Specify a property of "location" when creating the table. Just add a >>>>>> ".option("location", "path")" >>>>>> >>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Thanks Russell. This helps a lot. >>>>>> >>>>>> I want to specify a HDFS location when creating an iceberg dataset >>>>>> using dataframe api. All examples using warehouse location are SQL. Do >>>>>> you >>>>>> have an example for dataframe API? For example, how to support HDFS/S3 >>>>>> location in the query below? The reason I ask is that my current code all >>>>>> uses spark API. It will be much easier if I can use spark API when >>>>>> migrating parquet to iceberg. Hope it makes sense. >>>>>> >>>>>> data.writeTo("prod.db.table") >>>>>> .tableProperty("write.format.default", "orc") >>>>>> .partitionBy($"level", days($"ts")) >>>>>> .createOrReplace() >>>>>> >>>>>> >>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> The config you used specified a catalog named "hive_prod", so to >>>>>>> reference it you need to either "use hive_prod" or refer to the table >>>>>>> with >>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable" >>>>>>> >>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Ryan. >>>>>>>> >>>>>>>> Using this command (uri is omitted because the uri is in >>>>>>>> hive-site.xml): >>>>>>>> spark-shell --conf >>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog >>>>>>>> --conf >>>>>>>> spark.sql.catalog.hive_prod.type=hive >>>>>>>> >>>>>>>> This statement: >>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING >>>>>>>> iceberg") >>>>>>>> >>>>>>>> caused warning: >>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe >>>>>>>> for data source provider iceberg. >>>>>>>> >>>>>>>> I tried: >>>>>>>> * the solution (put iceberg-hive-runtime.jar and >>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in >>>>>>>> https://github.com/apache/iceberg/issues/2260 >>>>>>>> * use --packages >>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>>>>>> >>>>>>>> but they did not help. This warning blocks inserting any data into >>>>>>>> this table. Any ideas are appreciated! >>>>>>>> >>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>>> Lian, >>>>>>>>> >>>>>>>>> I think we should improve the docs for catalogs since it isn’t >>>>>>>>> clear. We have a few configuration pages that are helpful, but it >>>>>>>>> looks >>>>>>>>> like they assume you know what your options are already. Take a look >>>>>>>>> at the >>>>>>>>> Spark docs for catalogs, which is the closest we have right now: >>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration >>>>>>>>> >>>>>>>>> What you’ll want to do is to configure a catalog like the first >>>>>>>>> example: >>>>>>>>> >>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog >>>>>>>>> spark.sql.catalog.hive_prod.type = hive >>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port >>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in >>>>>>>>> hive-site.xml >>>>>>>>> >>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which >>>>>>>>> is why it can’t be used by SQL. This is something that should >>>>>>>>> probably be >>>>>>>>> added to Spark and not Iceberg since it is just a different way to >>>>>>>>> build >>>>>>>>> the same underlying Spark plan. >>>>>>>>> >>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL >>>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced >>>>>>>>> by >>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg >>>>>>>>> should support both because DataFrames are useful for customization >>>>>>>>> in some >>>>>>>>> cases. It really should be up to you and what you want to use, as far >>>>>>>>> as >>>>>>>>> Iceberg is concerned. >>>>>>>>> >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks Eduard and Ryan. >>>>>>>>>> >>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add >>>>>>>>>> an external table in hive metastore for this parquet. In the future, >>>>>>>>>> when >>>>>>>>>> using iceberg, I prefer hive metadata store since it is my >>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see >>>>>>>>>> that >>>>>>>>>> hive metastore is supported in iceberg AWS integration on >>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that? >>>>>>>>>> >>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For >>>>>>>>>> example, there is no "sql merge into" like support for spark API. Is >>>>>>>>>> spark >>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, >>>>>>>>>> could you >>>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev >>>>>>>>>> friendly and scalable. Thanks very much! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>>>>> >>>>>>>>>>> Lian, >>>>>>>>>>> >>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just >>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's >>>>>>>>>>> warehouse >>>>>>>>>>> location to S3 so tables are automatically created there. >>>>>>>>>>> >>>>>>>>>>> The only restriction for S3 is that you need a metastore to >>>>>>>>>>> track the table metadata location because S3 doesn't have a way to >>>>>>>>>>> implement a metadata commit. For a metastore, there are >>>>>>>>>>> implementations >>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the >>>>>>>>>>> upcoming >>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC. >>>>>>>>>>> >>>>>>>>>>> Ryan >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner < >>>>>>>>>>> edu...@dremio.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. >>>>>>>>>>>> It should contain all the info that you need. The codebase >>>>>>>>>>>> contains a *S3FileIO >>>>>>>>>>>> *class, which is an implementation that is backed by S3. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang < >>>>>>>>>>>> jiangok2...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I am reading >>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and >>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. >>>>>>>>>>>>> This guide >>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I >>>>>>>>>>>>> understand >>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 >>>>>>>>>>>>> path. How >>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Tabular >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Create your own email signature >>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Create your own email signature >>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Create your own email signature >>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Create your own email signature >>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>> >>>> >>>> >>>> -- >>>> >>>> Create your own email signature >>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>> >>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> >> >> -- >> >> Create your own email signature >> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >> > > > -- > Ryan Blue > Tabular > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>