Re: Writing iceberg table to S3

Lian Jiang Tue, 10 Aug 2021 11:07:13 -0700

Thanks Russell.

I tried:


/spark/bin/spark-shell --packages
org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
--conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.hive_test.type=hive

import org.apache.spark.sql.SparkSession
val values = List(1,2,3,4,5)

val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF()

val table = "hive_test.mydb.mytable3"
df.writeTo(table)
    .tableProperty("write.format.default", "parquet")
*    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
    .createOrReplace()

spark.table(table).show()

*Observations*:
1. spark.table(table).show() does show the table correctly.
+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
+-----+

2. mydb.mytable3 is created in HIVE but it is empty:
hive> select * from mytable3;
OK
Time taken: 0.158 seconds

3. test.ice is not generated in the HDFS folder /tmp.

Any idea about 2 and 3? Thanks very much.





On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> Specify a property of "location" when creating the table. Just add a
> ".option("location", "path")"
>
> On Aug 10, 2021, at 11:15 AM, Lian Jiang <jiangok2...@gmail.com> wrote:
>
> Thanks Russell. This helps a lot.
>
> I want to specify a HDFS location when creating an iceberg dataset using
> dataframe api. All examples using warehouse location are SQL. Do you have
> an example for dataframe API? For example, how to support HDFS/S3 location
> in the query below? The reason I ask is that my current code all uses spark
> API. It will be much easier if I can use spark API when migrating parquet
> to iceberg. Hope it makes sense.
>
> data.writeTo("prod.db.table")
>     .tableProperty("write.format.default", "orc")
>     .partitionBy($"level", days($"ts"))
>     .createOrReplace()
>
>
> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> The config you used specified a catalog named "hive_prod", so to
>> reference it you need to either "use hive_prod" or refer to the table with
>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>
>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <jiangok2...@gmail.com> wrote:
>>
>>> Thanks Ryan.
>>>
>>> Using this command (uri is omitted because the uri is in hive-site.xml):
>>> spark-shell --conf
>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>> spark.sql.catalog.hive_prod.type=hive
>>>
>>> This statement:
>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>
>>> caused warning:
>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>> data source provider iceberg.
>>>
>>> I tried:
>>> * the solution (put iceberg-hive-runtime.jar and
>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>> https://github.com/apache/iceberg/issues/2260
>>> * use --packages
>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>
>>> but they did not help. This warning blocks inserting any data into this
>>> table. Any ideas are appreciated!
>>>
>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> Lian,
>>>>
>>>> I think we should improve the docs for catalogs since it isn’t clear.
>>>> We have a few configuration pages that are helpful, but it looks like they
>>>> assume you know what your options are already. Take a look at the Spark
>>>> docs for catalogs, which is the closest we have right now:
>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>
>>>> What you’ll want to do is to configure a catalog like the first example:
>>>>
>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>> spark.sql.catalog.hive_prod.type = hive
>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in 
>>>> hive-site.xml
>>>>
>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is
>>>> why it can’t be used by SQL. This is something that should probably be
>>>> added to Spark and not Iceberg since it is just a different way to build
>>>> the same underlying Spark plan.
>>>>
>>>> To your question about dataframes vs SQL, I highly recommend SQL over
>>>> DataFrames so that you don’t end up needing to use Jars produced by
>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>> should support both because DataFrames are useful for customization in some
>>>> cases. It really should be up to you and what you want to use, as far as
>>>> Iceberg is concerned.
>>>>
>>>> Ryan
>>>>
>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <jiangok2...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Eduard and Ryan.
>>>>>
>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>>>> external table in hive metastore for this parquet. In the future, when
>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>> hive metastore is supported in iceberg AWS integration on
>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>
>>>>> Most of the examples use spark sql to write/read iceberg. For example,
>>>>> there is no "sql merge into" like support for spark API. Is spark sql
>>>>> preferred over spark dataframe/dataset API in Iceberg? If so, could you
>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>> friendly and scalable. Thanks very much!
>>>>>
>>>>>
>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> Lian,
>>>>>>
>>>>>> Iceberg tables work great in S3. When creating the table, just pass
>>>>>> the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>> location to S3 so tables are automatically created there.
>>>>>>
>>>>>> The only restriction for S3 is that you need a metastore to track the
>>>>>> table metadata location because S3 doesn't have a way to implement a
>>>>>> metadata commit. For a metastore, there are implementations backed by the
>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <edu...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>> should contain all the info that you need. The codebase contains a 
>>>>>>> *S3FileIO
>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <jiangok2...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes
>>>>>>>> and wondering if it is possible to create an iceberg table on S3. This
>>>>>>>> guide seems to say only write to a hive table (backed up by HDFS if I
>>>>>>>> understand correctly). Hudi and Delta can write to s3 with a specified 
>>>>>>>> S3
>>>>>>>> path. How can I do it using iceberg? Thanks for any clarification.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>
>
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Reply via email to