Re: Writing iceberg table to S3

Ryan Blue Wed, 11 Aug 2021 14:00:17 -0700

Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there is
a storage handler set up for it.


On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <[email protected]> wrote:

> Thanks guys. tableProperty("location", ...) works.
>
> I have trouble making hive query an iceberg table by following
> https://iceberg.apache.org/hive/.
>
> I have done:
> *  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
> * in hive-site.xml, add hive.vectorized.execution.enabled=false and 
> iceberg.engine.hive.enabled=true.
> The same hive-site.xml is used by both hive server and spark.
>
>
> This is my code:
> val table = "hive_test.mydb.mytable3"
> val filePath = "hdfs://namenode:8020/tmp/test3.ice"
> df.writeTo(table)
>     .tableProperty("write.format.default", "parquet")
>     .tableProperty("location", filePath)
>     .createOrReplace()
>
>
> The iceberg file is created in the specified location. It can be queried
> in spark sql.
> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
> Found 2 items
> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
> /tmp/test3.ice/data
> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
> /tmp/test3.ice/metadata
>
> This hive table is created but cannot be queried:
> hive> select * from mytable3;
> FAILED: SemanticException Table does not exist at location:
> hdfs://namenode:8020/tmp/test3.ice
>
> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very
> close to having a happy path for migrating parquet to iceberg. Thanks.
>
>
>
> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <[email protected]> wrote:
>
>> The problem for #3 is how Spark handles the options. The option method
>> sets write options, not table properties. The write options aren’t passed
>> when creating the table. Instead, you should use tableProperty("location",
>> ...).
>>
>> Ryan
>>
>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>>> integration from iceberg. So you shouldn't see it in hive unless you have
>>> configured that, see https://iceberg.apache.org/hive/.
>>>
>>> 3)
>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>>> I would check what properties are set in the table to see why that
>>> wasn't set. But "location" would be the correct way of setting the table.
>>> Unless the property is being ignored by Spark, I'm assuming you are using
>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which
>>> ignores options passed to the V2 api sometimes,
>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in
>>> 3.1
>>>
>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <[email protected]> wrote:
>>>
>>> Any help is highly appreciated!
>>>
>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <[email protected]>
>>> wrote:
>>>
>>>> Thanks Russell.
>>>>
>>>> I tried:
>>>>
>>>> /spark/bin/spark-shell --packages
>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>>> --conf spark.sql.catalog.hive_test.type=hive
>>>>
>>>> import org.apache.spark.sql.SparkSession
>>>> val values = List(1,2,3,4,5)
>>>>
>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>> import spark.implicits._
>>>> val df = values.toDF()
>>>>
>>>> val table = "hive_test.mydb.mytable3"
>>>> df.writeTo(table)
>>>>     .tableProperty("write.format.default", "parquet")
>>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>>     .createOrReplace()
>>>>
>>>> spark.table(table).show()
>>>>
>>>> *Observations*:
>>>> 1. spark.table(table).show() does show the table correctly.
>>>> +-----+
>>>> |value|
>>>> +-----+
>>>> |    1|
>>>> |    2|
>>>> |    3|
>>>> |    4|
>>>> |    5|
>>>> +-----+
>>>>
>>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>>> hive> select * from mytable3;
>>>> OK
>>>> Time taken: 0.158 seconds
>>>>
>>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>>
>>>> Any idea about 2 and 3? Thanks very much.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>>> [email protected]> wrote:
>>>>
>>>>> Specify a property of "location" when creating the table. Just add a
>>>>> ".option("location", "path")"
>>>>>
>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Thanks Russell. This helps a lot.
>>>>>
>>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>>> using dataframe api. All examples using warehouse location are SQL. Do you
>>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>>> location in the query below? The reason I ask is that my current code all
>>>>> uses spark API. It will be much easier if I can use spark API when
>>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>>
>>>>> data.writeTo("prod.db.table")
>>>>>     .tableProperty("write.format.default", "orc")
>>>>>     .partitionBy($"level", days($"ts"))
>>>>>     .createOrReplace()
>>>>>
>>>>>
>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>>> reference it you need to either "use hive_prod" or refer to the table 
>>>>>> with
>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Ryan.
>>>>>>>
>>>>>>> Using this command (uri is omitted because the uri is in
>>>>>>> hive-site.xml):
>>>>>>> spark-shell --conf
>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog --conf
>>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>>
>>>>>>> This statement:
>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING iceberg")
>>>>>>>
>>>>>>> caused warning:
>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for
>>>>>>> data source provider iceberg.
>>>>>>>
>>>>>>> I tried:
>>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>>> * use --packages
>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>>
>>>>>>> but they did not help. This warning blocks inserting any data into
>>>>>>> this table. Any ideas are appreciated!
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> Lian,
>>>>>>>>
>>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>>> clear. We have a few configuration pages that are helpful, but it looks
>>>>>>>> like they assume you know what your options are already. Take a look 
>>>>>>>> at the
>>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>>
>>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>>> example:
>>>>>>>>
>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in 
>>>>>>>> hive-site.xml
>>>>>>>>
>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which
>>>>>>>> is why it can’t be used by SQL. This is something that should probably 
>>>>>>>> be
>>>>>>>> added to Spark and not Iceberg since it is just a different way to 
>>>>>>>> build
>>>>>>>> the same underlying Spark plan.
>>>>>>>>
>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced 
>>>>>>>> by
>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>>>> should support both because DataFrames are useful for customization in 
>>>>>>>> some
>>>>>>>> cases. It really should be up to you and what you want to use, as far 
>>>>>>>> as
>>>>>>>> Iceberg is concerned.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>>
>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add
>>>>>>>>> an external table in hive metastore for this parquet. In the future, 
>>>>>>>>> when
>>>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see 
>>>>>>>>> that
>>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>>>
>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>>> example, there is no "sql merge into" like support for spark API. Is 
>>>>>>>>> spark
>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, 
>>>>>>>>> could you
>>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Lian,
>>>>>>>>>>
>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's 
>>>>>>>>>> warehouse
>>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>>
>>>>>>>>>> The only restriction for S3 is that you need a metastore to track
>>>>>>>>>> the table metadata location because S3 doesn't have a way to 
>>>>>>>>>> implement a
>>>>>>>>>> metadata commit. For a metastore, there are implementations backed 
>>>>>>>>>> by the
>>>>>>>>>> Hive MetaStore, Glue/DynamoDB, and Nessie. And the upcoming release 
>>>>>>>>>> adds
>>>>>>>>>> support for DynamoDB without Glue and JDBC.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/. It
>>>>>>>>>>> should contain all the info that you need. The codebase contains a 
>>>>>>>>>>> *S3FileIO
>>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am reading
>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. This 
>>>>>>>>>>>> guide
>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I 
>>>>>>>>>>>> understand
>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 
>>>>>>>>>>>> path. How
>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Create your own email signature
>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>


-- 
Ryan Blue
Tabular

Re: Writing iceberg table to S3

Reply via email to