Re: Writing iceberg table to S3

Lian Jiang Wed, 11 Aug 2021 09:01:17 -0700

Any help is highly appreciated!

On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <[email protected]> wrote:


> Thanks Russell.
>
> I tried:
>
> /spark/bin/spark-shell --packages
> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
> --conf spark.sql.catalog.hive_test.type=hive
>
> import org.apache.spark.sql.SparkSession
> val values = List(1,2,3,4,5)
>
> val spark = SparkSession.builder().master("local").getOrCreate()
> import spark.implicits._
> val df = values.toDF()
>
> val table = "hive_test.mydb.mytable3"
> df.writeTo(table)
>     .tableProperty("write.format.default", "parquet")
> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>     .createOrReplace()
>
> spark.table(table).show()
>
> *Observations*:
> 1. spark.table(table).show() does show the table correctly.
> +-----+
> |value|
> +-----+
> |    1|
> |    2|
> |    3|
> |    4|
> |    5|
> +-----+
>
> 2. mydb.mytable3 is created in HIVE but it is empty:
> hive> select * from mytable3;
> OK
> Time taken: 0.158 seconds
>
> 3. test.ice is not generated in the HDFS folder /tmp.
>
> Any idea about 2 and 3? Thanks very much.
>
>
>
>
>
> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <[email protected]>
> wrote:
>
>> Specify a property of "location" when creating the table. Just add a
>> ".option("location", "path")"
>>
>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <[email protected]> wrote:
>>
>> Thanks Russell. This helps a lot.
>>
>> I want to specify a HDFS location when creating an iceberg dataset using
>> dataframe api. All examples using warehouse location are SQL. Do you have
>> an example for dataframe API? For example, how to support HDFS/S3 location
>> in the query below? The reason I ask is that my current code all uses spark
>> API. It will be much easier if I can use spark API when migrating parquet
>> to iceberg. Hope it makes sense.
>>
>> data.writeTo("prod.db.table")
>>     .tableProperty("write.format.default", "orc")
>>     .partitionBy($"level", days($"ts"))
>>     .createOrReplace()
>>
>>
>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <[email protected]>
>> wrote:
>>
>>> The config you used specified a catalog named "hive_prod", so to
>>> reference it you need to either "use hive_prod" or refer to the table with
>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>
>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <[email protected]> wrote:
>>>
>>>> Thanks Ryan.
>>>>
>>>> Using this command (uri is omitted because the uri is in hive-site.xml):
>>>> spark-shell --conf
>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>> spark.sql.catalog.hive_prod.type=hive
>>>>
>>>> This statement:
>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>>
>>>> caused warning:
>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>>> data source provider iceberg.
>>>>
>>>> I tried:
>>>> * the solution (put iceberg-hive-runtime.jar and
>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>> https://github.com/apache/iceberg/issues/2260
>>>> * use --packages
>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>
>>>> but they did not help. This warning blocks inserting any data into this
>>>> table. Any ideas are appreciated!
>>>>
>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <[email protected]> wrote:
>>>>
>>>>> Lian,
>>>>>
>>>>> I think we should improve the docs for catalogs since it isn’t clear.
>>>>> We have a few configuration pages that are helpful, but it looks like they
>>>>> assume you know what your options are already. Take a look at the Spark
>>>>> docs for catalogs, which is the closest we have right now:
>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>
>>>>> What you’ll want to do is to configure a catalog like the first
>>>>> example:
>>>>>
>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in 
>>>>> hive-site.xml
>>>>>
>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which is
>>>>> why it can’t be used by SQL. This is something that should probably be
>>>>> added to Spark and not Iceberg since it is just a different way to build
>>>>> the same underlying Spark plan.
>>>>>
>>>>> To your question about dataframes vs SQL, I highly recommend SQL over
>>>>> DataFrames so that you don’t end up needing to use Jars produced by
>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>> should support both because DataFrames are useful for customization in 
>>>>> some
>>>>> cases. It really should be up to you and what you want to use, as far as
>>>>> Iceberg is concerned.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Eduard and Ryan.
>>>>>>
>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add an
>>>>>> external table in hive metastore for this parquet. In the future, when
>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>> centralized metastore for batch and streaming datasets. I don't see that
>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>
>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>> example, there is no "sql merge into" like support for spark API. Is 
>>>>>> spark
>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, could 
>>>>>> you
>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>> friendly and scalable. Thanks very much!
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>>> Lian,
>>>>>>>
>>>>>>> Iceberg tables work great in S3. When creating the table, just pass
>>>>>>> the `LOCATION` clause with an S3 path, or set your catalog's warehouse
>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>
>>>>>>> The only restriction for S3 is that you need a metastore to track
>>>>>>> the table metadata location because S3 doesn't have a way to implement a
>>>>>>> metadata commit. For a metastore, there are implementations backed by 
>>>>>>> the
>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release adds
>>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>>> should contain all the info that you need. The codebase contains a 
>>>>>>>> *S3FileIO
>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am reading https://iceberg.apache.org/spark-writes/#spark-writes
>>>>>>>>> and wondering if it is possible to create an iceberg table on S3. This
>>>>>>>>> guide seems to say only write to a hive table (backed up by HDFS if I
>>>>>>>>> understand correctly). Hudi and Delta can write to s3 with a 
>>>>>>>>> specified S3
>>>>>>>>> path. How can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Reply via email to