Looks like the table is set up correctly. I think the problem might be how Hive is configured. I think by default it will try to load tables by location in 0.11.1. You need to tell it to load tables as metastore tables, not HDFS tables by running `SET iceberg.mr.catalog=hive`.
On Wed, Aug 11, 2021 at 3:51 PM Lian Jiang <jiangok2...@gmail.com> wrote: > hive> describe formatted mytable3; > OK > # col_name data_type comment > value int > > # Detailed Table Information > Database: mydb > OwnerType: USER > Owner: root > CreateTime: Wed Aug 11 20:02:14 UTC 2021 > LastAccessTime: Sun Jan 11 15:25:29 UTC 1970 > Retention: 2147483647 > Location: hdfs://namenode:8020/tmp/test3.ice > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNAL TRUE > metadata_location > hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json > numFiles 8 > storage_handler > org.apache.iceberg.mr.hive.HiveIcebergStorageHandler > table_type ICEBERG > totalSize 12577 > transient_lastDdlTime 1628712134 > > # Storage Information > SerDe Library: org.apache.iceberg.mr.hive.HiveIcebergSerDe > InputFormat: org.apache.iceberg.mr.hive.HiveIcebergInputFormat > > OutputFormat: org.apache.iceberg.mr.hive.HiveIcebergOutputFormat > > Compressed: No > Num Buckets: 0 > Bucket Columns: [] > Sort Columns: [] > Time taken: 0.319 seconds, Fetched: 29 row(s) > > hive> select * from mytable3; > FAILED: SemanticException Table does not exist at location: > hdfs://namenode:8020/tmp/test3.ice > > Thanks! > > > > On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <b...@tabular.io> wrote: > >> Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there >> is a storage handler set up for it. >> >> On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <jiangok2...@gmail.com> wrote: >> >>> Thanks guys. tableProperty("location", ...) works. >>> >>> I have trouble making hive query an iceberg table by following >>> https://iceberg.apache.org/hive/. >>> >>> I have done: >>> * in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;` >>> * in hive-site.xml, add hive.vectorized.execution.enabled=false and >>> iceberg.engine.hive.enabled=true. >>> The same hive-site.xml is used by both hive server and spark. >>> >>> >>> This is my code: >>> val table = "hive_test.mydb.mytable3" >>> val filePath = "hdfs://namenode:8020/tmp/test3.ice" >>> df.writeTo(table) >>> .tableProperty("write.format.default", "parquet") >>> .tableProperty("location", filePath) >>> .createOrReplace() >>> >>> >>> The iceberg file is created in the specified location. It can be queried >>> in spark sql. >>> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/ >>> Found 2 items >>> drwxrwxr-x - root supergroup 0 2021-08-11 20:02 >>> /tmp/test3.ice/data >>> drwxrwxr-x - root supergroup 0 2021-08-11 20:02 >>> /tmp/test3.ice/metadata >>> >>> This hive table is created but cannot be queried: >>> hive> select * from mytable3; >>> FAILED: SemanticException Table does not exist at location: >>> hdfs://namenode:8020/tmp/test3.ice >>> >>> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very >>> close to having a happy path for migrating parquet to iceberg. Thanks. >>> >>> >>> >>> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> The problem for #3 is how Spark handles the options. The option method >>>> sets write options, not table properties. The write options aren’t passed >>>> when creating the table. Instead, you should use tableProperty("location", >>>> ...). >>>> >>>> Ryan >>>> >>>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive >>>>> integration from iceberg. So you shouldn't see it in hive unless you have >>>>> configured that, see https://iceberg.apache.org/hive/. >>>>> >>>>> 3) >>>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137 >>>>> I would check what properties are set in the table to see why that >>>>> wasn't set. But "location" would be the correct way of setting the table. >>>>> Unless the property is being ignored by Spark, I'm assuming you are using >>>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which >>>>> ignores options passed to the V2 api sometimes, >>>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in >>>>> 3.1 >>>>> >>>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <jiangok2...@gmail.com> >>>>> wrote: >>>>> >>>>> Any help is highly appreciated! >>>>> >>>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Russell. >>>>>> >>>>>> I tried: >>>>>> >>>>>> /spark/bin/spark-shell --packages >>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog >>>>>> --conf spark.sql.catalog.hive_test.type=hive >>>>>> >>>>>> import org.apache.spark.sql.SparkSession >>>>>> val values = List(1,2,3,4,5) >>>>>> >>>>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>>>> import spark.implicits._ >>>>>> val df = values.toDF() >>>>>> >>>>>> val table = "hive_test.mydb.mytable3" >>>>>> df.writeTo(table) >>>>>> .tableProperty("write.format.default", "parquet") >>>>>> * .option("location", "hdfs://namenode:8020/tmp/test.ice")* >>>>>> .createOrReplace() >>>>>> >>>>>> spark.table(table).show() >>>>>> >>>>>> *Observations*: >>>>>> 1. spark.table(table).show() does show the table correctly. >>>>>> +-----+ >>>>>> |value| >>>>>> +-----+ >>>>>> | 1| >>>>>> | 2| >>>>>> | 3| >>>>>> | 4| >>>>>> | 5| >>>>>> +-----+ >>>>>> >>>>>> 2. mydb.mytable3 is created in HIVE but it is empty: >>>>>> hive> select * from mytable3; >>>>>> OK >>>>>> Time taken: 0.158 seconds >>>>>> >>>>>> 3. test.ice is not generated in the HDFS folder /tmp. >>>>>> >>>>>> Any idea about 2 and 3? Thanks very much. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> Specify a property of "location" when creating the table. Just add a >>>>>>> ".option("location", "path")" >>>>>>> >>>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Thanks Russell. This helps a lot. >>>>>>> >>>>>>> I want to specify a HDFS location when creating an iceberg dataset >>>>>>> using dataframe api. All examples using warehouse location are SQL. Do >>>>>>> you >>>>>>> have an example for dataframe API? For example, how to support HDFS/S3 >>>>>>> location in the query below? The reason I ask is that my current code >>>>>>> all >>>>>>> uses spark API. It will be much easier if I can use spark API when >>>>>>> migrating parquet to iceberg. Hope it makes sense. >>>>>>> >>>>>>> data.writeTo("prod.db.table") >>>>>>> .tableProperty("write.format.default", "orc") >>>>>>> .partitionBy($"level", days($"ts")) >>>>>>> .createOrReplace() >>>>>>> >>>>>>> >>>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> The config you used specified a catalog named "hive_prod", so to >>>>>>>> reference it you need to either "use hive_prod" or refer to the table >>>>>>>> with >>>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable" >>>>>>>> >>>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks Ryan. >>>>>>>>> >>>>>>>>> Using this command (uri is omitted because the uri is in >>>>>>>>> hive-site.xml): >>>>>>>>> spark-shell --conf >>>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog >>>>>>>>> --conf >>>>>>>>> spark.sql.catalog.hive_prod.type=hive >>>>>>>>> >>>>>>>>> This statement: >>>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING >>>>>>>>> iceberg") >>>>>>>>> >>>>>>>>> caused warning: >>>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe >>>>>>>>> for data source provider iceberg. >>>>>>>>> >>>>>>>>> I tried: >>>>>>>>> * the solution (put iceberg-hive-runtime.jar and >>>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in >>>>>>>>> https://github.com/apache/iceberg/issues/2260 >>>>>>>>> * use --packages >>>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>>>>>>> >>>>>>>>> but they did not help. This warning blocks inserting any data into >>>>>>>>> this table. Any ideas are appreciated! >>>>>>>>> >>>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>>>> >>>>>>>>>> Lian, >>>>>>>>>> >>>>>>>>>> I think we should improve the docs for catalogs since it isn’t >>>>>>>>>> clear. We have a few configuration pages that are helpful, but it >>>>>>>>>> looks >>>>>>>>>> like they assume you know what your options are already. Take a look >>>>>>>>>> at the >>>>>>>>>> Spark docs for catalogs, which is the closest we have right now: >>>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration >>>>>>>>>> >>>>>>>>>> What you’ll want to do is to configure a catalog like the first >>>>>>>>>> example: >>>>>>>>>> >>>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog >>>>>>>>>> spark.sql.catalog.hive_prod.type = hive >>>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port >>>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in >>>>>>>>>> hive-site.xml >>>>>>>>>> >>>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which >>>>>>>>>> is why it can’t be used by SQL. This is something that should >>>>>>>>>> probably be >>>>>>>>>> added to Spark and not Iceberg since it is just a different way to >>>>>>>>>> build >>>>>>>>>> the same underlying Spark plan. >>>>>>>>>> >>>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL >>>>>>>>>> over DataFrames so that you don’t end up needing to use Jars >>>>>>>>>> produced by >>>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But >>>>>>>>>> Iceberg >>>>>>>>>> should support both because DataFrames are useful for customization >>>>>>>>>> in some >>>>>>>>>> cases. It really should be up to you and what you want to use, as >>>>>>>>>> far as >>>>>>>>>> Iceberg is concerned. >>>>>>>>>> >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Eduard and Ryan. >>>>>>>>>>> >>>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add >>>>>>>>>>> an external table in hive metastore for this parquet. In the >>>>>>>>>>> future, when >>>>>>>>>>> using iceberg, I prefer hive metadata store since it is my >>>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see >>>>>>>>>>> that >>>>>>>>>>> hive metastore is supported in iceberg AWS integration on >>>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that? >>>>>>>>>>> >>>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For >>>>>>>>>>> example, there is no "sql merge into" like support for spark API. >>>>>>>>>>> Is spark >>>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, >>>>>>>>>>> could you >>>>>>>>>>> clarify the rationale behind? I personally feel spark API is more >>>>>>>>>>> dev >>>>>>>>>>> friendly and scalable. Thanks very much! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Lian, >>>>>>>>>>>> >>>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just >>>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's >>>>>>>>>>>> warehouse >>>>>>>>>>>> location to S3 so tables are automatically created there. >>>>>>>>>>>> >>>>>>>>>>>> The only restriction for S3 is that you need a metastore to >>>>>>>>>>>> track the table metadata location because S3 doesn't have a way to >>>>>>>>>>>> implement a metadata commit. For a metastore, there are >>>>>>>>>>>> implementations >>>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the >>>>>>>>>>>> upcoming >>>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC. >>>>>>>>>>>> >>>>>>>>>>>> Ryan >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner < >>>>>>>>>>>> edu...@dremio.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. >>>>>>>>>>>>> It should contain all the info that you need. The codebase >>>>>>>>>>>>> contains a *S3FileIO >>>>>>>>>>>>> *class, which is an implementation that is backed by S3. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang < >>>>>>>>>>>>> jiangok2...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I am reading >>>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and >>>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. >>>>>>>>>>>>>> This guide >>>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I >>>>>>>>>>>>>> understand >>>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 >>>>>>>>>>>>>> path. How >>>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Ryan Blue >>>>>>>>>>>> Tabular >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> Create your own email signature >>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Create your own email signature >>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Create your own email signature >>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Create your own email signature >>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Create your own email signature >>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >>> >>> -- >>> >>> Create your own email signature >>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>> >> >> >> -- >> Ryan Blue >> Tabular >> > > > -- > > Create your own email signature > <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> > -- Ryan Blue Tabular