Thanks guys. tableProperty("location", ...) works. I have trouble making hive query an iceberg table by following https://iceberg.apache.org/hive/.
I have done: * in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;` * in hive-site.xml, add hive.vectorized.execution.enabled=false and iceberg.engine.hive.enabled=true. The same hive-site.xml is used by both hive server and spark. This is my code: val table = "hive_test.mydb.mytable3" val filePath = "hdfs://namenode:8020/tmp/test3.ice" df.writeTo(table) .tableProperty("write.format.default", "parquet") .tableProperty("location", filePath) .createOrReplace() The iceberg file is created in the specified location. It can be queried in spark sql. root@datanode:/# hdfs dfs -ls /tmp/test3.ice/ Found 2 items drwxrwxr-x - root supergroup 0 2021-08-11 20:02 /tmp/test3.ice/data drwxrwxr-x - root supergroup 0 2021-08-11 20:02 /tmp/test3.ice/metadata This hive table is created but cannot be queried: hive> select * from mytable3; FAILED: SemanticException Table does not exist at location: hdfs://namenode:8020/tmp/test3.ice I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very close to having a happy path for migrating parquet to iceberg. Thanks. On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <b...@tabular.io> wrote: > The problem for #3 is how Spark handles the options. The option method > sets write options, not table properties. The write options aren’t passed > when creating the table. Instead, you should use tableProperty("location", > ...). > > Ryan > > On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> 2) Hive cannot read Iceberg tables without configuring the MR Hive >> integration from iceberg. So you shouldn't see it in hive unless you have >> configured that, see https://iceberg.apache.org/hive/. >> >> 3) >> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137 >> I would check what properties are set in the table to see why that wasn't >> set. But "location" would be the correct way of setting the table. Unless >> the property is being ignored by Spark, I'm assuming you are using the >> latest build possible of Spark. There is a bug in 3.0 of Spark which >> ignores options passed to the V2 api sometimes, >> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in 3.1 >> >> On Aug 11, 2021, at 11:00 AM, Lian Jiang <jiangok2...@gmail.com> wrote: >> >> Any help is highly appreciated! >> >> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2...@gmail.com> >> wrote: >> >>> Thanks Russell. >>> >>> I tried: >>> >>> /spark/bin/spark-shell --packages >>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog >>> --conf spark.sql.catalog.hive_test.type=hive >>> >>> import org.apache.spark.sql.SparkSession >>> val values = List(1,2,3,4,5) >>> >>> val spark = SparkSession.builder().master("local").getOrCreate() >>> import spark.implicits._ >>> val df = values.toDF() >>> >>> val table = "hive_test.mydb.mytable3" >>> df.writeTo(table) >>> .tableProperty("write.format.default", "parquet") >>> * .option("location", "hdfs://namenode:8020/tmp/test.ice")* >>> .createOrReplace() >>> >>> spark.table(table).show() >>> >>> *Observations*: >>> 1. spark.table(table).show() does show the table correctly. >>> +-----+ >>> |value| >>> +-----+ >>> | 1| >>> | 2| >>> | 3| >>> | 4| >>> | 5| >>> +-----+ >>> >>> 2. mydb.mytable3 is created in HIVE but it is empty: >>> hive> select * from mytable3; >>> OK >>> Time taken: 0.158 seconds >>> >>> 3. test.ice is not generated in the HDFS folder /tmp. >>> >>> Any idea about 2 and 3? Thanks very much. >>> >>> >>> >>> >>> >>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> Specify a property of "location" when creating the table. Just add a >>>> ".option("location", "path")" >>>> >>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> wrote: >>>> >>>> Thanks Russell. This helps a lot. >>>> >>>> I want to specify a HDFS location when creating an iceberg dataset >>>> using dataframe api. All examples using warehouse location are SQL. Do you >>>> have an example for dataframe API? For example, how to support HDFS/S3 >>>> location in the query below? The reason I ask is that my current code all >>>> uses spark API. It will be much easier if I can use spark API when >>>> migrating parquet to iceberg. Hope it makes sense. >>>> >>>> data.writeTo("prod.db.table") >>>> .tableProperty("write.format.default", "orc") >>>> .partitionBy($"level", days($"ts")) >>>> .createOrReplace() >>>> >>>> >>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> The config you used specified a catalog named "hive_prod", so to >>>>> reference it you need to either "use hive_prod" or refer to the table with >>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable" >>>>> >>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Ryan. >>>>>> >>>>>> Using this command (uri is omitted because the uri is in >>>>>> hive-site.xml): >>>>>> spark-shell --conf >>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf >>>>>> spark.sql.catalog.hive_prod.type=hive >>>>>> >>>>>> This statement: >>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg") >>>>>> >>>>>> caused warning: >>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for >>>>>> data source provider iceberg. >>>>>> >>>>>> I tried: >>>>>> * the solution (put iceberg-hive-runtime.jar and >>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in >>>>>> https://github.com/apache/iceberg/issues/2260 >>>>>> * use --packages >>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>>>> >>>>>> but they did not help. This warning blocks inserting any data into >>>>>> this table. Any ideas are appreciated! >>>>>> >>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> Lian, >>>>>>> >>>>>>> I think we should improve the docs for catalogs since it isn’t >>>>>>> clear. We have a few configuration pages that are helpful, but it looks >>>>>>> like they assume you know what your options are already. Take a look at >>>>>>> the >>>>>>> Spark docs for catalogs, which is the closest we have right now: >>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration >>>>>>> >>>>>>> What you’ll want to do is to configure a catalog like the first >>>>>>> example: >>>>>>> >>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog >>>>>>> spark.sql.catalog.hive_prod.type = hive >>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port >>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in >>>>>>> hive-site.xml >>>>>>> >>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is >>>>>>> why it can’t be used by SQL. This is something that should probably be >>>>>>> added to Spark and not Iceberg since it is just a different way to build >>>>>>> the same underlying Spark plan. >>>>>>> >>>>>>> To your question about dataframes vs SQL, I highly recommend SQL >>>>>>> over DataFrames so that you don’t end up needing to use Jars produced by >>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg >>>>>>> should support both because DataFrames are useful for customization in >>>>>>> some >>>>>>> cases. It really should be up to you and what you want to use, as far as >>>>>>> Iceberg is concerned. >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Eduard and Ryan. >>>>>>>> >>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an >>>>>>>> external table in hive metastore for this parquet. In the future, when >>>>>>>> using iceberg, I prefer hive metadata store since it is my >>>>>>>> centralized metastore for batch and streaming datasets. I don't see >>>>>>>> that >>>>>>>> hive metastore is supported in iceberg AWS integration on >>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that? >>>>>>>> >>>>>>>> Most of the examples use spark sql to write/read iceberg. For >>>>>>>> example, there is no "sql merge into" like support for spark API. Is >>>>>>>> spark >>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, >>>>>>>> could you >>>>>>>> clarify the rationale behind? I personally feel spark API is more dev >>>>>>>> friendly and scalable. Thanks very much! >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>>> Lian, >>>>>>>>> >>>>>>>>> Iceberg tables work great in S3. When creating the table, just >>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's >>>>>>>>> warehouse >>>>>>>>> location to S3 so tables are automatically created there. >>>>>>>>> >>>>>>>>> The only restriction for S3 is that you need a metastore to track >>>>>>>>> the table metadata location because S3 doesn't have a way to >>>>>>>>> implement a >>>>>>>>> metadata commit. For a metastore, there are implementations backed by >>>>>>>>> the >>>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release >>>>>>>>> adds >>>>>>>>> support for DynamoDB without Glue and JDBC. >>>>>>>>> >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner < >>>>>>>>> edu...@dremio.com> wrote: >>>>>>>>> >>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It >>>>>>>>>> should contain all the info that you need. The codebase contains a >>>>>>>>>> *S3FileIO >>>>>>>>>> *class, which is an implementation that is backed by S3. >>>>>>>>>> >>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I am reading >>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and >>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This >>>>>>>>>>> guide >>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I >>>>>>>>>>> understand >>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 >>>>>>>>>>> path. How >>>>>>>>>>> can I do it using iceberg? Thanks for any clarification. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Create your own email signature >>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Create your own email signature >>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>> >>>>> >>>> >>>> -- >>>> >>>> Create your own email signature >>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>> >>>> >>>> >>> >>> -- >>> >>> Create your own email signature >>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>> >> >> >> -- >> >> Create your own email signature >> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >> >> >> > > -- > Ryan Blue > Tabular > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>