Re: Writing iceberg table to S3

Lian Jiang Wed, 11 Aug 2021 13:46:34 -0700

Thanks guys. tableProperty("location", ...) works.

I have trouble making hive query an iceberg table by following
https://iceberg.apache.org/hive/.


I have done:
*  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
* in hive-site.xml, add hive.vectorized.execution.enabled=false and
iceberg.engine.hive.enabled=true.
The same hive-site.xml is used by both hive server and spark.


This is my code:
val table = "hive_test.mydb.mytable3"
val filePath = "hdfs://namenode:8020/tmp/test3.ice"
df.writeTo(table)
    .tableProperty("write.format.default", "parquet")
    .tableProperty("location", filePath)
    .createOrReplace()


The iceberg file is created in the specified location. It can be queried in
spark sql.
root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
Found 2 items
drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
/tmp/test3.ice/data
drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
/tmp/test3.ice/metadata

This hive table is created but cannot be queried:
hive> select * from mytable3;
FAILED: SemanticException Table does not exist at location:
hdfs://namenode:8020/tmp/test3.ice

I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very
close to having a happy path for migrating parquet to iceberg. Thanks.



On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <b...@tabular.io> wrote:

> The problem for #3 is how Spark handles the options. The option method
> sets write options, not table properties. The write options aren’t passed
> when creating the table. Instead, you should use tableProperty("location",
> ...).
>
> Ryan
>
> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>> integration from iceberg. So you shouldn't see it in hive unless you have
>> configured that, see https://iceberg.apache.org/hive/.
>>
>> 3)
>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>> I would check what properties are set in the table to see why that wasn't
>> set. But "location" would be the correct way of setting the table. Unless
>> the property is being ignored by Spark, I'm assuming you are using the
>> latest build possible of Spark. There is a bug in 3.0 of Spark which
>> ignores options passed to the V2 api sometimes,
>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in 3.1
>>
>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <jiangok2...@gmail.com> wrote:
>>
>> Any help is highly appreciated!
>>
>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <jiangok2...@gmail.com>
>> wrote:
>>
>>> Thanks Russell.
>>>
>>> I tried:
>>>
>>> /spark/bin/spark-shell --packages
>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>> --conf spark.sql.catalog.hive_test.type=hive
>>>
>>> import org.apache.spark.sql.SparkSession
>>> val values = List(1,2,3,4,5)
>>>
>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>> import spark.implicits._
>>> val df = values.toDF()
>>>
>>> val table = "hive_test.mydb.mytable3"
>>> df.writeTo(table)
>>>     .tableProperty("write.format.default", "parquet")
>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>     .createOrReplace()
>>>
>>> spark.table(table).show()
>>>
>>> *Observations*:
>>> 1. spark.table(table).show() does show the table correctly.
>>> +-----+
>>> |value|
>>> +-----+
>>> |    1|
>>> |    2|
>>> |    3|
>>> |    4|
>>> |    5|
>>> +-----+
>>>
>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>> hive> select * from mytable3;
>>> OK
>>> Time taken: 0.158 seconds
>>>
>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>
>>> Any idea about 2 and 3? Thanks very much.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> Specify a property of "location" when creating the table. Just add a
>>>> ".option("location", "path")"
>>>>
>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> wrote:
>>>>
>>>> Thanks Russell. This helps a lot.
>>>>
>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>> using dataframe api. All examples using warehouse location are SQL. Do you
>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>> location in the query below? The reason I ask is that my current code all
>>>> uses spark API. It will be much easier if I can use spark API when
>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>
>>>> data.writeTo("prod.db.table")
>>>>     .tableProperty("write.format.default", "orc")
>>>>     .partitionBy($"level", days($"ts"))
>>>>     .createOrReplace()
>>>>
>>>>
>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>> reference it you need to either "use hive_prod" or refer to the table with
>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>
>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Ryan.
>>>>>>
>>>>>> Using this command (uri is omitted because the uri is in
>>>>>> hive-site.xml):
>>>>>> spark-shell --conf
>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>
>>>>>> This statement:
>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>>>>
>>>>>> caused warning:
>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>>>>> data source provider iceberg.
>>>>>>
>>>>>> I tried:
>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>> * use --packages
>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>
>>>>>> but they did not help. This warning blocks inserting any data into
>>>>>> this table. Any ideas are appreciated!
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> Lian,
>>>>>>>
>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>> clear. We have a few configuration pages that are helpful, but it looks
>>>>>>> like they assume you know what your options are already. Take a look at 
>>>>>>> the
>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>
>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>> example:
>>>>>>>
>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in 
>>>>>>> hive-site.xml
>>>>>>>
>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is
>>>>>>> why it can’t be used by SQL. This is something that should probably be
>>>>>>> added to Spark and not Iceberg since it is just a different way to build
>>>>>>> the same underlying Spark plan.
>>>>>>>
>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced by
>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>>> should support both because DataFrames are useful for customization in 
>>>>>>> some
>>>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>>>> Iceberg is concerned.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>
>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>>>>>>> external table in hive metastore for this parquet. In the future, when
>>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>>> centralized metastore for batch and streaming datasets. I don't see 
>>>>>>>> that
>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>>
>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>> example, there is no "sql merge into" like support for spark API. Is 
>>>>>>>> spark
>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, 
>>>>>>>> could you
>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> Lian,
>>>>>>>>>
>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's 
>>>>>>>>> warehouse
>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>
>>>>>>>>> The only restriction for S3 is that you need a metastore to track
>>>>>>>>> the table metadata location because S3 doesn't have a way to 
>>>>>>>>> implement a
>>>>>>>>> metadata commit. For a metastore, there are implementations backed by 
>>>>>>>>> the
>>>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release 
>>>>>>>>> adds
>>>>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>> edu...@dremio.com> wrote:
>>>>>>>>>
>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>>>>> should contain all the info that you need. The codebase contains a 
>>>>>>>>>> *S3FileIO
>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <jiangok2...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am reading
>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This 
>>>>>>>>>>> guide
>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I 
>>>>>>>>>>> understand
>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 
>>>>>>>>>>> path. How
>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Reply via email to