Re: Writing iceberg table to S3

Lian Jiang Wed, 11 Aug 2021 17:03:18 -0700

`SET iceberg.mr.catalog=hive` works!!! Thanks Ryan, you rock!!! You may
consider adding the below into iceberg document to help other newcomers.


Add `SET iceberg.mr.catalog=hive` to https://iceberg.apache.org/hive/.
Add `.tableProperty("location", filePath)` to
https://iceberg.apache.org/spark-writes/.



On Wed, Aug 11, 2021 at 3:56 PM Ryan Blue <b...@tabular.io> wrote:

> Looks like the table is set up correctly. I think the problem might be how
> Hive is configured. I think by default it will try to load tables by
> location in 0.11.1. You need to tell it to load tables as metastore tables,
> not HDFS tables by running `SET iceberg.mr.catalog=hive`.
>
> On Wed, Aug 11, 2021 at 3:51 PM Lian Jiang <jiangok2...@gmail.com> wrote:
>
>> hive> describe formatted mytable3;
>> OK
>> # col_name              data_type               comment
>> value                   int
>>
>> # Detailed Table Information
>> Database:               mydb
>> OwnerType:              USER
>> Owner:                  root
>> CreateTime:             Wed Aug 11 20:02:14 UTC 2021
>> LastAccessTime:         Sun Jan 11 15:25:29 UTC 1970
>> Retention:              2147483647
>> Location:               hdfs://namenode:8020/tmp/test3.ice
>> Table Type:             EXTERNAL_TABLE
>> Table Parameters:
>>         EXTERNAL                TRUE
>>         metadata_location
>> hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json
>>         numFiles                8
>>         storage_handler
>> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
>>         table_type              ICEBERG
>>         totalSize               12577
>>         transient_lastDdlTime   1628712134
>>
>> # Storage Information
>> SerDe Library:          org.apache.iceberg.mr.hive.HiveIcebergSerDe
>> InputFormat:            org.apache.iceberg.mr.hive.HiveIcebergInputFormat
>>
>> OutputFormat:
>> org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
>> Compressed:             No
>> Num Buckets:            0
>> Bucket Columns:         []
>> Sort Columns:           []
>> Time taken: 0.319 seconds, Fetched: 29 row(s)
>>
>> hive> select * from mytable3;
>> FAILED: SemanticException Table does not exist at location:
>> hdfs://namenode:8020/tmp/test3.ice
>>
>> Thanks!
>>
>>
>>
>> On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there
>>> is a storage handler set up for it.
>>>
>>> On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <jiangok2...@gmail.com>
>>> wrote:
>>>
>>>> Thanks guys. tableProperty("location", ...) works.
>>>>
>>>> I have trouble making hive query an iceberg table by following
>>>> https://iceberg.apache.org/hive/.
>>>>
>>>> I have done:
>>>> *  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
>>>> * in hive-site.xml, add hive.vectorized.execution.enabled=false and 
>>>> iceberg.engine.hive.enabled=true.
>>>> The same hive-site.xml is used by both hive server and spark.
>>>>
>>>>
>>>> This is my code:
>>>> val table = "hive_test.mydb.mytable3"
>>>> val filePath = "hdfs://namenode:8020/tmp/test3.ice"
>>>> df.writeTo(table)
>>>>     .tableProperty("write.format.default", "parquet")
>>>>     .tableProperty("location", filePath)
>>>>     .createOrReplace()
>>>>
>>>>
>>>> The iceberg file is created in the specified location. It can be
>>>> queried in spark sql.
>>>> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
>>>> Found 2 items
>>>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>>>> /tmp/test3.ice/data
>>>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>>>> /tmp/test3.ice/metadata
>>>>
>>>> This hive table is created but cannot be queried:
>>>> hive> select * from mytable3;
>>>> FAILED: SemanticException Table does not exist at location:
>>>> hdfs://namenode:8020/tmp/test3.ice
>>>>
>>>> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am
>>>> very close to having a happy path for migrating parquet to iceberg. Thanks.
>>>>
>>>>
>>>>
>>>> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> The problem for #3 is how Spark handles the options. The option
>>>>> method sets write options, not table properties. The write options aren’t
>>>>> passed when creating the table. Instead, you should use 
>>>>> tableProperty("location",
>>>>> ...).
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>>>>>> integration from iceberg. So you shouldn't see it in hive unless you have
>>>>>> configured that, see https://iceberg.apache.org/hive/.
>>>>>>
>>>>>> 3)
>>>>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>>>>>> I would check what properties are set in the table to see why that
>>>>>> wasn't set. But "location" would be the correct way of setting the table.
>>>>>> Unless the property is being ignored by Spark, I'm assuming you are using
>>>>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which
>>>>>> ignores options passed to the V2 api sometimes,
>>>>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed
>>>>>> in 3.1
>>>>>>
>>>>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <jiangok2...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Any help is highly appreciated!
>>>>>>
>>>>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Russell.
>>>>>>>
>>>>>>> I tried:
>>>>>>>
>>>>>>> /spark/bin/spark-shell --packages
>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>>>>>> --conf spark.sql.catalog.hive_test.type=hive
>>>>>>>
>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>
>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>> import spark.implicits._
>>>>>>> val df = values.toDF()
>>>>>>>
>>>>>>> val table = "hive_test.mydb.mytable3"
>>>>>>> df.writeTo(table)
>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>>>>>     .createOrReplace()
>>>>>>>
>>>>>>> spark.table(table).show()
>>>>>>>
>>>>>>> *Observations*:
>>>>>>> 1. spark.table(table).show() does show the table correctly.
>>>>>>> +-----+
>>>>>>> |value|
>>>>>>> +-----+
>>>>>>> |    1|
>>>>>>> |    2|
>>>>>>> |    3|
>>>>>>> |    4|
>>>>>>> |    5|
>>>>>>> +-----+
>>>>>>>
>>>>>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>>>>>> hive> select * from mytable3;
>>>>>>> OK
>>>>>>> Time taken: 0.158 seconds
>>>>>>>
>>>>>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>>>>>
>>>>>>> Any idea about 2 and 3? Thanks very much.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Specify a property of "location" when creating the table. Just add
>>>>>>>> a ".option("location", "path")"
>>>>>>>>
>>>>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks Russell. This helps a lot.
>>>>>>>>
>>>>>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>>>>>> using dataframe api. All examples using warehouse location are SQL. Do 
>>>>>>>> you
>>>>>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>>>>>> location in the query below? The reason I ask is that my current code 
>>>>>>>> all
>>>>>>>> uses spark API. It will be much easier if I can use spark API when
>>>>>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>>>>>
>>>>>>>> data.writeTo("prod.db.table")
>>>>>>>>     .tableProperty("write.format.default", "orc")
>>>>>>>>     .partitionBy($"level", days($"ts"))
>>>>>>>>     .createOrReplace()
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>>>>>> reference it you need to either "use hive_prod" or refer to the table 
>>>>>>>>> with
>>>>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Ryan.
>>>>>>>>>>
>>>>>>>>>> Using this command (uri is omitted because the uri is in
>>>>>>>>>> hive-site.xml):
>>>>>>>>>> spark-shell --conf
>>>>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog 
>>>>>>>>>> --conf
>>>>>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>>>>>
>>>>>>>>>> This statement:
>>>>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING
>>>>>>>>>> iceberg")
>>>>>>>>>>
>>>>>>>>>> caused warning:
>>>>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe
>>>>>>>>>> for data source provider iceberg.
>>>>>>>>>>
>>>>>>>>>> I tried:
>>>>>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>>>>>> * use --packages
>>>>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>>>>>
>>>>>>>>>> but they did not help. This warning blocks inserting any data
>>>>>>>>>> into this table. Any ideas are appreciated!
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Lian,
>>>>>>>>>>>
>>>>>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>>>>>> clear. We have a few configuration pages that are helpful, but it 
>>>>>>>>>>> looks
>>>>>>>>>>> like they assume you know what your options are already. Take a 
>>>>>>>>>>> look at the
>>>>>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>>>>>
>>>>>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>>>>>> example:
>>>>>>>>>>>
>>>>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in 
>>>>>>>>>>> hive-site.xml
>>>>>>>>>>>
>>>>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark,
>>>>>>>>>>> which is why it can’t be used by SQL. This is something that should
>>>>>>>>>>> probably be added to Spark and not Iceberg since it is just a 
>>>>>>>>>>> different way
>>>>>>>>>>> to build the same underlying Spark plan.
>>>>>>>>>>>
>>>>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>>>>>> over DataFrames so that you don’t end up needing to use Jars 
>>>>>>>>>>> produced by
>>>>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But 
>>>>>>>>>>> Iceberg
>>>>>>>>>>> should support both because DataFrames are useful for customization 
>>>>>>>>>>> in some
>>>>>>>>>>> cases. It really should be up to you and what you want to use, as 
>>>>>>>>>>> far as
>>>>>>>>>>> Iceberg is concerned.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>>>>>
>>>>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then
>>>>>>>>>>>> add an external table in hive metastore for this parquet. In the 
>>>>>>>>>>>> future,
>>>>>>>>>>>> when using iceberg, I prefer hive metadata store since it is my
>>>>>>>>>>>> centralized metastore for batch and streaming datasets. I don't 
>>>>>>>>>>>> see that
>>>>>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for
>>>>>>>>>>>> that?
>>>>>>>>>>>>
>>>>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>>>>>> example, there is no "sql merge into" like support for spark API. 
>>>>>>>>>>>> Is spark
>>>>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, 
>>>>>>>>>>>> could you
>>>>>>>>>>>> clarify the rationale behind? I personally feel spark API is more 
>>>>>>>>>>>> dev
>>>>>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Lian,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's 
>>>>>>>>>>>>> warehouse
>>>>>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only restriction for S3 is that you need a metastore to
>>>>>>>>>>>>> track the table metadata location because S3 doesn't have a way to
>>>>>>>>>>>>> implement a metadata commit. For a metastore, there are 
>>>>>>>>>>>>> implementations
>>>>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the 
>>>>>>>>>>>>> upcoming
>>>>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>>>>>> edu...@dremio.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/.
>>>>>>>>>>>>>> It should contain all the info that you need. The codebase 
>>>>>>>>>>>>>> contains a *S3FileIO
>>>>>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <
>>>>>>>>>>>>>> jiangok2...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am reading
>>>>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. 
>>>>>>>>>>>>>>> This guide
>>>>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I 
>>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 
>>>>>>>>>>>>>>> path. How
>>>>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Create your own email signature
>>>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Create your own email signature
>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>
>
> --
> Ryan Blue
> Tabular
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Reply via email to