Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there is a storage handler set up for it.
On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <jiangok2...@gmail.com> wrote: > Thanks guys. tableProperty("location", ...) works. > > I have trouble making hive query an iceberg table by following > https://iceberg.apache.org/hive/. > > I have done: > * in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;` > * in hive-site.xml, add hive.vectorized.execution.enabled=false and > iceberg.engine.hive.enabled=true. > The same hive-site.xml is used by both hive server and spark. > > > This is my code: > val table = "hive_test.mydb.mytable3" > val filePath = "hdfs://namenode:8020/tmp/test3.ice" > df.writeTo(table) > .tableProperty("write.format.default", "parquet") > .tableProperty("location", filePath) > .createOrReplace() > > > The iceberg file is created in the specified location. It can be queried > in spark sql. > root@datanode:/# hdfs dfs -ls /tmp/test3.ice/ > Found 2 items > drwxrwxr-x - root supergroup 0 2021-08-11 20:02 > /tmp/test3.ice/data > drwxrwxr-x - root supergroup 0 2021-08-11 20:02 > /tmp/test3.ice/metadata > > This hive table is created but cannot be queried: > hive> select * from mytable3; > FAILED: SemanticException Table does not exist at location: > hdfs://namenode:8020/tmp/test3.ice > > I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very > close to having a happy path for migrating parquet to iceberg. Thanks. > > > > On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <b...@tabular.io> wrote: > >> The problem for #3 is how Spark handles the options. The option method >> sets write options, not table properties. The write options aren’t passed >> when creating the table. Instead, you should use tableProperty("location", >> ...). >> >> Ryan >> >> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> 2) Hive cannot read Iceberg tables without configuring the MR Hive >>> integration from iceberg. So you shouldn't see it in hive unless you have >>> configured that, see https://iceberg.apache.org/hive/. >>> >>> 3) >>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137 >>> I would check what properties are set in the table to see why that >>> wasn't set. But "location" would be the correct way of setting the table. >>> Unless the property is being ignored by Spark, I'm assuming you are using >>> the latest build possible of Spark. There is a bug in 3.0 of Spark which >>> ignores options passed to the V2 api sometimes, >>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in >>> 3.1 >>> >>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <jiangok2...@gmail.com> wrote: >>> >>> Any help is highly appreciated! >>> >>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2...@gmail.com> >>> wrote: >>> >>>> Thanks Russell. >>>> >>>> I tried: >>>> >>>> /spark/bin/spark-shell --packages >>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog >>>> --conf spark.sql.catalog.hive_test.type=hive >>>> >>>> import org.apache.spark.sql.SparkSession >>>> val values = List(1,2,3,4,5) >>>> >>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>> import spark.implicits._ >>>> val df = values.toDF() >>>> >>>> val table = "hive_test.mydb.mytable3" >>>> df.writeTo(table) >>>> .tableProperty("write.format.default", "parquet") >>>> * .option("location", "hdfs://namenode:8020/tmp/test.ice")* >>>> .createOrReplace() >>>> >>>> spark.table(table).show() >>>> >>>> *Observations*: >>>> 1. spark.table(table).show() does show the table correctly. >>>> +-----+ >>>> |value| >>>> +-----+ >>>> | 1| >>>> | 2| >>>> | 3| >>>> | 4| >>>> | 5| >>>> +-----+ >>>> >>>> 2. mydb.mytable3 is created in HIVE but it is empty: >>>> hive> select * from mytable3; >>>> OK >>>> Time taken: 0.158 seconds >>>> >>>> 3. test.ice is not generated in the HDFS folder /tmp. >>>> >>>> Any idea about 2 and 3? Thanks very much. >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> Specify a property of "location" when creating the table. Just add a >>>>> ".option("location", "path")" >>>>> >>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> >>>>> wrote: >>>>> >>>>> Thanks Russell. This helps a lot. >>>>> >>>>> I want to specify a HDFS location when creating an iceberg dataset >>>>> using dataframe api. All examples using warehouse location are SQL. Do you >>>>> have an example for dataframe API? For example, how to support HDFS/S3 >>>>> location in the query below? The reason I ask is that my current code all >>>>> uses spark API. It will be much easier if I can use spark API when >>>>> migrating parquet to iceberg. Hope it makes sense. >>>>> >>>>> data.writeTo("prod.db.table") >>>>> .tableProperty("write.format.default", "orc") >>>>> .partitionBy($"level", days($"ts")) >>>>> .createOrReplace() >>>>> >>>>> >>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> The config you used specified a catalog named "hive_prod", so to >>>>>> reference it you need to either "use hive_prod" or refer to the table >>>>>> with >>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable" >>>>>> >>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Ryan. >>>>>>> >>>>>>> Using this command (uri is omitted because the uri is in >>>>>>> hive-site.xml): >>>>>>> spark-shell --conf >>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf >>>>>>> spark.sql.catalog.hive_prod.type=hive >>>>>>> >>>>>>> This statement: >>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg") >>>>>>> >>>>>>> caused warning: >>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for >>>>>>> data source provider iceberg. >>>>>>> >>>>>>> I tried: >>>>>>> * the solution (put iceberg-hive-runtime.jar and >>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in >>>>>>> https://github.com/apache/iceberg/issues/2260 >>>>>>> * use --packages >>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1 >>>>>>> >>>>>>> but they did not help. This warning blocks inserting any data into >>>>>>> this table. Any ideas are appreciated! >>>>>>> >>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> Lian, >>>>>>>> >>>>>>>> I think we should improve the docs for catalogs since it isn’t >>>>>>>> clear. We have a few configuration pages that are helpful, but it looks >>>>>>>> like they assume you know what your options are already. Take a look >>>>>>>> at the >>>>>>>> Spark docs for catalogs, which is the closest we have right now: >>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration >>>>>>>> >>>>>>>> What you’ll want to do is to configure a catalog like the first >>>>>>>> example: >>>>>>>> >>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog >>>>>>>> spark.sql.catalog.hive_prod.type = hive >>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port >>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in >>>>>>>> hive-site.xml >>>>>>>> >>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which >>>>>>>> is why it can’t be used by SQL. This is something that should probably >>>>>>>> be >>>>>>>> added to Spark and not Iceberg since it is just a different way to >>>>>>>> build >>>>>>>> the same underlying Spark plan. >>>>>>>> >>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL >>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced >>>>>>>> by >>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg >>>>>>>> should support both because DataFrames are useful for customization in >>>>>>>> some >>>>>>>> cases. It really should be up to you and what you want to use, as far >>>>>>>> as >>>>>>>> Iceberg is concerned. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks Eduard and Ryan. >>>>>>>>> >>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add >>>>>>>>> an external table in hive metastore for this parquet. In the future, >>>>>>>>> when >>>>>>>>> using iceberg, I prefer hive metadata store since it is my >>>>>>>>> centralized metastore for batch and streaming datasets. I don't see >>>>>>>>> that >>>>>>>>> hive metastore is supported in iceberg AWS integration on >>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that? >>>>>>>>> >>>>>>>>> Most of the examples use spark sql to write/read iceberg. For >>>>>>>>> example, there is no "sql merge into" like support for spark API. Is >>>>>>>>> spark >>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, >>>>>>>>> could you >>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev >>>>>>>>> friendly and scalable. Thanks very much! >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> wrote: >>>>>>>>> >>>>>>>>>> Lian, >>>>>>>>>> >>>>>>>>>> Iceberg tables work great in S3. When creating the table, just >>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's >>>>>>>>>> warehouse >>>>>>>>>> location to S3 so tables are automatically created there. >>>>>>>>>> >>>>>>>>>> The only restriction for S3 is that you need a metastore to track >>>>>>>>>> the table metadata location because S3 doesn't have a way to >>>>>>>>>> implement a >>>>>>>>>> metadata commit. For a metastore, there are implementations backed >>>>>>>>>> by the >>>>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release >>>>>>>>>> adds >>>>>>>>>> support for DynamoDB without Glue and JDBC. >>>>>>>>>> >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner < >>>>>>>>>> edu...@dremio.com> wrote: >>>>>>>>>> >>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It >>>>>>>>>>> should contain all the info that you need. The codebase contains a >>>>>>>>>>> *S3FileIO >>>>>>>>>>> *class, which is an implementation that is backed by S3. >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <jiangok2...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I am reading >>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and >>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This >>>>>>>>>>>> guide >>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I >>>>>>>>>>>> understand >>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 >>>>>>>>>>>> path. How >>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Create your own email signature >>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Create your own email signature >>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Create your own email signature >>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> >>>> Create your own email signature >>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>> >>> >>> >>> -- >>> >>> Create your own email signature >>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>> >>> >>> >> >> -- >> Ryan Blue >> Tabular >> > > > -- > > Create your own email signature > <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> > -- Ryan Blue Tabular