Thanks Russell. I tried:
/spark/bin/spark-shell --packages org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.hive_test.type=hive import org.apache.spark.sql.SparkSession val values = List(1,2,3,4,5) val spark = SparkSession.builder().master("local").getOrCreate() import spark.implicits._ val df = values.toDF() val table = "hive_test.mydb.mytable3" df.writeTo(table) .tableProperty("write.format.default", "parquet") * .option("location", "hdfs://namenode:8020/tmp/test.ice")* .createOrReplace() spark.table(table).show() *Observations*: 1. spark.table(table).show() does show the table correctly. +-----+ |value| +-----+ | 1| | 2| | 3| | 4| | 5| +-----+ 2. mydb.mytable3 is created in HIVE but it is empty: hive> select * from mytable3; OK Time taken: 0.158 seconds 3. test.ice is not generated in the HDFS folder /tmp. Any idea about 2 and 3? Thanks very much. On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > Specify a property of "location" when creating the table. Just add a > ".option("location", "path")" > > On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> wrote: > > Thanks Russell. This helps a lot. > > I want to specify a HDFS location when creating an iceberg dataset using > dataframe api. All examples using warehouse location are SQL. Do you have > an example for dataframe API? For example, how to support HDFS/S3 location > in the query below? The reason I ask is that my current code all uses spark > API. It will be much easier if I can use spark API when migrating parquet > to iceberg. Hope it makes sense. > > data.writeTo("prod.db.table") > .tableProperty("write.format.default", "orc") > .partitionBy($"level", days($"ts")) > .createOrReplace() > > > On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> The config you used specified a catalog named "hive_prod", so to >> reference it you need to either "use hive_prod" or refer to the table with >> the catalog identifier "CREATE TABLE hive_prod.default.mytable" >> >> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com> wrote: >> >>> Thanks Ryan. >>> >>> Using this command (uri is omitted because the uri is in hive-site.xml): >>> spark-shell --conf >>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf >>> spark.sql.catalog.hive_prod.type=hive >>> >>> This statement: >>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg") >>> >>> caused warning: >>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for >>> data source provider iceberg. >>> >>> I tried: >>> * the solution (put iceberg-hive-runtime.jar and >>> iceberg-spark3-runtime.jar to spark/jars) mentioned in >>> https://github.com/apache/iceberg/issues/2260 >>> * use --packages >>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>> >>> but they did not help. This warning blocks inserting any data into this >>> table. Any ideas are appreciated! >>> >>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Lian, >>>> >>>> I think we should improve the docs for catalogs since it isn’t clear. >>>> We have a few configuration pages that are helpful, but it looks like they >>>> assume you know what your options are already. Take a look at the Spark >>>> docs for catalogs, which is the closest we have right now: >>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration >>>> >>>> What you’ll want to do is to configure a catalog like the first example: >>>> >>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog >>>> spark.sql.catalog.hive_prod.type = hive >>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port >>>> # omit uri to use the same URI as Spark: hive.metastore.uris in >>>> hive-site.xml >>>> >>>> For MERGE INTO, the DataFrame API is not present in Spark, which is >>>> why it can’t be used by SQL. This is something that should probably be >>>> added to Spark and not Iceberg since it is just a different way to build >>>> the same underlying Spark plan. >>>> >>>> To your question about dataframes vs SQL, I highly recommend SQL over >>>> DataFrames so that you don’t end up needing to use Jars produced by >>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg >>>> should support both because DataFrames are useful for customization in some >>>> cases. It really should be up to you and what you want to use, as far as >>>> Iceberg is concerned. >>>> >>>> Ryan >>>> >>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Eduard and Ryan. >>>>> >>>>> I use spark on a K8S cluster to write parquet on s3 and then add an >>>>> external table in hive metastore for this parquet. In the future, when >>>>> using iceberg, I prefer hive metadata store since it is my >>>>> centralized metastore for batch and streaming datasets. I don't see that >>>>> hive metastore is supported in iceberg AWS integration on >>>>> https://iceberg.apache.org/aws/. Is there another link for that? >>>>> >>>>> Most of the examples use spark sql to write/read iceberg. For example, >>>>> there is no "sql merge into" like support for spark API. Is spark sql >>>>> preferred over spark dataframe/dataset API in Iceberg? If so, could you >>>>> clarify the rationale behind? I personally feel spark API is more dev >>>>> friendly and scalable. Thanks very much! >>>>> >>>>> >>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> Lian, >>>>>> >>>>>> Iceberg tables work great in S3. When creating the table, just pass >>>>>> the `LOCATION` clause with an S3 path, or set your catalog's warehouse >>>>>> location to S3 so tables are automatically created there. >>>>>> >>>>>> The only restriction for S3 is that you need a metastore to track the >>>>>> table metadata location because S3 doesn't have a way to implement a >>>>>> metadata commit. For a metastore, there are implementations backed by the >>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds >>>>>> support for DynamoDB without Glue and JDBC. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <edu...@dremio.com> >>>>>> wrote: >>>>>> >>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It >>>>>>> should contain all the info that you need. The codebase contains a >>>>>>> *S3FileIO >>>>>>> *class, which is an implementation that is backed by S3. >>>>>>> >>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes >>>>>>>> and wondering if it is possible to create an iceberg table on S3. This >>>>>>>> guide seems to say only write to a hive table (backed up by HDFS if I >>>>>>>> understand correctly). Hudi and Delta can write to s3 with a specified >>>>>>>> S3 >>>>>>>> path. How can I do it using iceberg? Thanks for any clarification. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Create your own email signature >>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >>> >>> -- >>> >>> Create your own email signature >>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>> >> > > -- > > Create your own email signature > <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> > > > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>