Any help is highly appreciated! On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2...@gmail.com> wrote:
> Thanks Russell. > > I tried: > > /spark/bin/spark-shell --packages > org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 > --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog > --conf spark.sql.catalog.hive_test.type=hive > > import org.apache.spark.sql.SparkSession > val values = List(1,2,3,4,5) > > val spark = SparkSession.builder().master("local").getOrCreate() > import spark.implicits._ > val df = values.toDF() > > val table = "hive_test.mydb.mytable3" > df.writeTo(table) > .tableProperty("write.format.default", "parquet") > * .option("location", "hdfs://namenode:8020/tmp/test.ice")* > .createOrReplace() > > spark.table(table).show() > > *Observations*: > 1. spark.table(table).show() does show the table correctly. > +-----+ > |value| > +-----+ > | 1| > | 2| > | 3| > | 4| > | 5| > +-----+ > > 2. mydb.mytable3 is created in HIVE but it is empty: > hive> select * from mytable3; > OK > Time taken: 0.158 seconds > > 3. test.ice is not generated in the HDFS folder /tmp. > > Any idea about 2 and 3? Thanks very much. > > > > > > On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> Specify a property of "location" when creating the table. Just add a >> ".option("location", "path")" >> >> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> wrote: >> >> Thanks Russell. This helps a lot. >> >> I want to specify a HDFS location when creating an iceberg dataset using >> dataframe api. All examples using warehouse location are SQL. Do you have >> an example for dataframe API? For example, how to support HDFS/S3 location >> in the query below? The reason I ask is that my current code all uses spark >> API. It will be much easier if I can use spark API when migrating parquet >> to iceberg. Hope it makes sense. >> >> data.writeTo("prod.db.table") >> .tableProperty("write.format.default", "orc") >> .partitionBy($"level", days($"ts")) >> .createOrReplace() >> >> >> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <russell.spit...@gmail.com> >> wrote: >> >>> The config you used specified a catalog named "hive_prod", so to >>> reference it you need to either "use hive_prod" or refer to the table with >>> the catalog identifier "CREATE TABLE hive_prod.default.mytable" >>> >>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com> wrote: >>> >>>> Thanks Ryan. >>>> >>>> Using this command (uri is omitted because the uri is in hive-site.xml): >>>> spark-shell --conf >>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf >>>> spark.sql.catalog.hive_prod.type=hive >>>> >>>> This statement: >>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg") >>>> >>>> caused warning: >>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for >>>> data source provider iceberg. >>>> >>>> I tried: >>>> * the solution (put iceberg-hive-runtime.jar and >>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in >>>> https://github.com/apache/iceberg/issues/2260 >>>> * use --packages >>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>> >>>> but they did not help. This warning blocks inserting any data into this >>>> table. Any ideas are appreciated! >>>> >>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> Lian, >>>>> >>>>> I think we should improve the docs for catalogs since it isn’t clear. >>>>> We have a few configuration pages that are helpful, but it looks like they >>>>> assume you know what your options are already. Take a look at the Spark >>>>> docs for catalogs, which is the closest we have right now: >>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration >>>>> >>>>> What you’ll want to do is to configure a catalog like the first >>>>> example: >>>>> >>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog >>>>> spark.sql.catalog.hive_prod.type = hive >>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port >>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in >>>>> hive-site.xml >>>>> >>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is >>>>> why it can’t be used by SQL. This is something that should probably be >>>>> added to Spark and not Iceberg since it is just a different way to build >>>>> the same underlying Spark plan. >>>>> >>>>> To your question about dataframes vs SQL, I highly recommend SQL over >>>>> DataFrames so that you don’t end up needing to use Jars produced by >>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg >>>>> should support both because DataFrames are useful for customization in >>>>> some >>>>> cases. It really should be up to you and what you want to use, as far as >>>>> Iceberg is concerned. >>>>> >>>>> Ryan >>>>> >>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Eduard and Ryan. >>>>>> >>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an >>>>>> external table in hive metastore for this parquet. In the future, when >>>>>> using iceberg, I prefer hive metadata store since it is my >>>>>> centralized metastore for batch and streaming datasets. I don't see that >>>>>> hive metastore is supported in iceberg AWS integration on >>>>>> https://iceberg.apache.org/aws/. Is there another link for that? >>>>>> >>>>>> Most of the examples use spark sql to write/read iceberg. For >>>>>> example, there is no "sql merge into" like support for spark API. Is >>>>>> spark >>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could >>>>>> you >>>>>> clarify the rationale behind? I personally feel spark API is more dev >>>>>> friendly and scalable. Thanks very much! >>>>>> >>>>>> >>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>>> Lian, >>>>>>> >>>>>>> Iceberg tables work great in S3. When creating the table, just pass >>>>>>> the `LOCATION` clause with an S3 path, or set your catalog's warehouse >>>>>>> location to S3 so tables are automatically created there. >>>>>>> >>>>>>> The only restriction for S3 is that you need a metastore to track >>>>>>> the table metadata location because S3 doesn't have a way to implement a >>>>>>> metadata commit. For a metastore, there are implementations backed by >>>>>>> the >>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds >>>>>>> support for DynamoDB without Glue and JDBC. >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner < >>>>>>> edu...@dremio.com> wrote: >>>>>>> >>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It >>>>>>>> should contain all the info that you need. The codebase contains a >>>>>>>> *S3FileIO >>>>>>>> *class, which is an implementation that is backed by S3. >>>>>>>> >>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes >>>>>>>>> and wondering if it is possible to create an iceberg table on S3. This >>>>>>>>> guide seems to say only write to a hive table (backed up by HDFS if I >>>>>>>>> understand correctly). Hudi and Delta can write to s3 with a >>>>>>>>> specified S3 >>>>>>>>> path. How can I do it using iceberg? Thanks for any clarification. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Tabular >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Create your own email signature >>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>>> >>>> -- >>>> >>>> Create your own email signature >>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>> >>> >> >> -- >> >> Create your own email signature >> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >> >> >> > > -- > > Create your own email signature > <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>