`SET iceberg.mr.catalog=hive` works!!! Thanks Ryan, you rock!!! You may consider adding the below into iceberg document to help other newcomers.
Add `SET iceberg.mr.catalog=hive` to https://iceberg.apache.org/hive/. Add `.tableProperty("location", filePath)` to https://iceberg.apache.org/spark-writes/. On Wed, Aug 11, 2021 at 3:56 PM Ryan Blue <b...@tabular.io> wrote: > Looks like the table is set up correctly. I think the problem might be how > Hive is configured. I think by default it will try to load tables by > location in 0.11.1. You need to tell it to load tables as metastore tables, > not HDFS tables by running `SET iceberg.mr.catalog=hive`. > > On Wed, Aug 11, 2021 at 3:51 PM Lian Jiang <jiangok2...@gmail.com> wrote: > >> hive> describe formatted mytable3; >> OK >> # col_name data_type comment >> value int >> >> # Detailed Table Information >> Database: mydb >> OwnerType: USER >> Owner: root >> CreateTime: Wed Aug 11 20:02:14 UTC 2021 >> LastAccessTime: Sun Jan 11 15:25:29 UTC 1970 >> Retention: 2147483647 >> Location: hdfs://namenode:8020/tmp/test3.ice >> Table Type: EXTERNAL_TABLE >> Table Parameters: >> EXTERNAL TRUE >> metadata_location >> hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json >> numFiles 8 >> storage_handler >> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler >> table_type ICEBERG >> totalSize 12577 >> transient_lastDdlTime 1628712134 >> >> # Storage Information >> SerDe Library: org.apache.iceberg.mr.hive.HiveIcebergSerDe >> InputFormat: org.apache.iceberg.mr.hive.HiveIcebergInputFormat >> >> OutputFormat: >> org.apache.iceberg.mr.hive.HiveIcebergOutputFormat >> Compressed: No >> Num Buckets: 0 >> Bucket Columns: [] >> Sort Columns: [] >> Time taken: 0.319 seconds, Fetched: 29 row(s) >> >> hive> select * from mytable3; >> FAILED: SemanticException Table does not exist at location: >> hdfs://namenode:8020/tmp/test3.ice >> >> Thanks! >> >> >> >> On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there >>> is a storage handler set up for it. >>> >>> On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <jiangok2...@gmail.com> >>> wrote: >>> >>>> Thanks guys. tableProperty("location", ...) works. >>>> >>>> I have trouble making hive query an iceberg table by following >>>> https://iceberg.apache.org/hive/. >>>> >>>> I have done: >>>> * in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;` >>>> * in hive-site.xml, add hive.vectorized.execution.enabled=false and >>>> iceberg.engine.hive.enabled=true. >>>> The same hive-site.xml is used by both hive server and spark. >>>> >>>> >>>> This is my code: >>>> val table = "hive_test.mydb.mytable3" >>>> val filePath = "hdfs://namenode:8020/tmp/test3.ice" >>>> df.writeTo(table) >>>> .tableProperty("write.format.default", "parquet") >>>> .tableProperty("location", filePath) >>>> .createOrReplace() >>>> >>>> >>>> The iceberg file is created in the specified location. It can be >>>> queried in spark sql. >>>> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/ >>>> Found 2 items >>>> drwxrwxr-x - root supergroup 0 2021-08-11 20:02 >>>> /tmp/test3.ice/data >>>> drwxrwxr-x - root supergroup 0 2021-08-11 20:02 >>>> /tmp/test3.ice/metadata >>>> >>>> This hive table is created but cannot be queried: >>>> hive> select * from mytable3; >>>> FAILED: SemanticException Table does not exist at location: >>>> hdfs://namenode:8020/tmp/test3.ice >>>> >>>> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am >>>> very close to having a happy path for migrating parquet to iceberg. Thanks. >>>> >>>> >>>> >>>> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> The problem for #3 is how Spark handles the options. The option >>>>> method sets write options, not table properties. The write options aren’t >>>>> passed when creating the table. Instead, you should use >>>>> tableProperty("location", >>>>> ...). >>>>> >>>>> Ryan >>>>> >>>>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive >>>>>> integration from iceberg. So you shouldn't see it in hive unless you have >>>>>> configured that, see https://iceberg.apache.org/hive/. >>>>>> >>>>>> 3) >>>>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137 >>>>>> I would check what properties are set in the table to see why that >>>>>> wasn't set. But "location" would be the correct way of setting the table. >>>>>> Unless the property is being ignored by Spark, I'm assuming you are using >>>>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which >>>>>> ignores options passed to the V2 api sometimes, >>>>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed >>>>>> in 3.1 >>>>>> >>>>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <jiangok2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Any help is highly appreciated! >>>>>> >>>>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Russell. >>>>>>> >>>>>>> I tried: >>>>>>> >>>>>>> /spark/bin/spark-shell --packages >>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog >>>>>>> --conf spark.sql.catalog.hive_test.type=hive >>>>>>> >>>>>>> import org.apache.spark.sql.SparkSession >>>>>>> val values = List(1,2,3,4,5) >>>>>>> >>>>>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>>>>> import spark.implicits._ >>>>>>> val df = values.toDF() >>>>>>> >>>>>>> val table = "hive_test.mydb.mytable3" >>>>>>> df.writeTo(table) >>>>>>> .tableProperty("write.format.default", "parquet") >>>>>>> * .option("location", "hdfs://namenode:8020/tmp/test.ice")* >>>>>>> .createOrReplace() >>>>>>> >>>>>>> spark.table(table).show() >>>>>>> >>>>>>> *Observations*: >>>>>>> 1. spark.table(table).show() does show the table correctly. >>>>>>> +-----+ >>>>>>> |value| >>>>>>> +-----+ >>>>>>> | 1| >>>>>>> | 2| >>>>>>> | 3| >>>>>>> | 4| >>>>>>> | 5| >>>>>>> +-----+ >>>>>>> >>>>>>> 2. mydb.mytable3 is created in HIVE but it is empty: >>>>>>> hive> select * from mytable3; >>>>>>> OK >>>>>>> Time taken: 0.158 seconds >>>>>>> >>>>>>> 3. test.ice is not generated in the HDFS folder /tmp. >>>>>>> >>>>>>> Any idea about 2 and 3? Thanks very much. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> Specify a property of "location" when creating the table. Just add >>>>>>>> a ".option("location", "path")" >>>>>>>> >>>>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks Russell. This helps a lot. >>>>>>>> >>>>>>>> I want to specify a HDFS location when creating an iceberg dataset >>>>>>>> using dataframe api. All examples using warehouse location are SQL. Do >>>>>>>> you >>>>>>>> have an example for dataframe API? For example, how to support HDFS/S3 >>>>>>>> location in the query below? The reason I ask is that my current code >>>>>>>> all >>>>>>>> uses spark API. It will be much easier if I can use spark API when >>>>>>>> migrating parquet to iceberg. Hope it makes sense. >>>>>>>> >>>>>>>> data.writeTo("prod.db.table") >>>>>>>> .tableProperty("write.format.default", "orc") >>>>>>>> .partitionBy($"level", days($"ts")) >>>>>>>> .createOrReplace() >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer < >>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>> >>>>>>>>> The config you used specified a catalog named "hive_prod", so to >>>>>>>>> reference it you need to either "use hive_prod" or refer to the table >>>>>>>>> with >>>>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable" >>>>>>>>> >>>>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks Ryan. >>>>>>>>>> >>>>>>>>>> Using this command (uri is omitted because the uri is in >>>>>>>>>> hive-site.xml): >>>>>>>>>> spark-shell --conf >>>>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog >>>>>>>>>> --conf >>>>>>>>>> spark.sql.catalog.hive_prod.type=hive >>>>>>>>>> >>>>>>>>>> This statement: >>>>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING >>>>>>>>>> iceberg") >>>>>>>>>> >>>>>>>>>> caused warning: >>>>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe >>>>>>>>>> for data source provider iceberg. >>>>>>>>>> >>>>>>>>>> I tried: >>>>>>>>>> * the solution (put iceberg-hive-runtime.jar and >>>>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in >>>>>>>>>> https://github.com/apache/iceberg/issues/2260 >>>>>>>>>> * use --packages >>>>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>>>>>>>> >>>>>>>>>> but they did not help. This warning blocks inserting any data >>>>>>>>>> into this table. Any ideas are appreciated! >>>>>>>>>> >>>>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Lian, >>>>>>>>>>> >>>>>>>>>>> I think we should improve the docs for catalogs since it isn’t >>>>>>>>>>> clear. We have a few configuration pages that are helpful, but it >>>>>>>>>>> looks >>>>>>>>>>> like they assume you know what your options are already. Take a >>>>>>>>>>> look at the >>>>>>>>>>> Spark docs for catalogs, which is the closest we have right now: >>>>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration >>>>>>>>>>> >>>>>>>>>>> What you’ll want to do is to configure a catalog like the first >>>>>>>>>>> example: >>>>>>>>>>> >>>>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog >>>>>>>>>>> spark.sql.catalog.hive_prod.type = hive >>>>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port >>>>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in >>>>>>>>>>> hive-site.xml >>>>>>>>>>> >>>>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, >>>>>>>>>>> which is why it can’t be used by SQL. This is something that should >>>>>>>>>>> probably be added to Spark and not Iceberg since it is just a >>>>>>>>>>> different way >>>>>>>>>>> to build the same underlying Spark plan. >>>>>>>>>>> >>>>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL >>>>>>>>>>> over DataFrames so that you don’t end up needing to use Jars >>>>>>>>>>> produced by >>>>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But >>>>>>>>>>> Iceberg >>>>>>>>>>> should support both because DataFrames are useful for customization >>>>>>>>>>> in some >>>>>>>>>>> cases. It really should be up to you and what you want to use, as >>>>>>>>>>> far as >>>>>>>>>>> Iceberg is concerned. >>>>>>>>>>> >>>>>>>>>>> Ryan >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Eduard and Ryan. >>>>>>>>>>>> >>>>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then >>>>>>>>>>>> add an external table in hive metastore for this parquet. In the >>>>>>>>>>>> future, >>>>>>>>>>>> when using iceberg, I prefer hive metadata store since it is my >>>>>>>>>>>> centralized metastore for batch and streaming datasets. I don't >>>>>>>>>>>> see that >>>>>>>>>>>> hive metastore is supported in iceberg AWS integration on >>>>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for >>>>>>>>>>>> that? >>>>>>>>>>>> >>>>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For >>>>>>>>>>>> example, there is no "sql merge into" like support for spark API. >>>>>>>>>>>> Is spark >>>>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, >>>>>>>>>>>> could you >>>>>>>>>>>> clarify the rationale behind? I personally feel spark API is more >>>>>>>>>>>> dev >>>>>>>>>>>> friendly and scalable. Thanks very much! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Lian, >>>>>>>>>>>>> >>>>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just >>>>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's >>>>>>>>>>>>> warehouse >>>>>>>>>>>>> location to S3 so tables are automatically created there. >>>>>>>>>>>>> >>>>>>>>>>>>> The only restriction for S3 is that you need a metastore to >>>>>>>>>>>>> track the table metadata location because S3 doesn't have a way to >>>>>>>>>>>>> implement a metadata commit. For a metastore, there are >>>>>>>>>>>>> implementations >>>>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the >>>>>>>>>>>>> upcoming >>>>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC. >>>>>>>>>>>>> >>>>>>>>>>>>> Ryan >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner < >>>>>>>>>>>>> edu...@dremio.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. >>>>>>>>>>>>>> It should contain all the info that you need. The codebase >>>>>>>>>>>>>> contains a *S3FileIO >>>>>>>>>>>>>> *class, which is an implementation that is backed by S3. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang < >>>>>>>>>>>>>> jiangok2...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am reading >>>>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and >>>>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. >>>>>>>>>>>>>>> This guide >>>>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I >>>>>>>>>>>>>>> understand >>>>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 >>>>>>>>>>>>>>> path. How >>>>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>> Tabular >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> Create your own email signature >>>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Tabular >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Create your own email signature >>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Create your own email signature >>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Create your own email signature >>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Create your own email signature >>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>>> >>>> -- >>>> >>>> Create your own email signature >>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>> >>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> >> >> -- >> >> Create your own email signature >> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >> > > > -- > Ryan Blue > Tabular > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>