Re: Writing iceberg table to S3

Lian Jiang Wed, 11 Aug 2021 15:51:50 -0700

hive> describe formatted mytable3;
OK
# col_name              data_type               comment
value                   int


# Detailed Table Information
Database:               mydb
OwnerType:              USER
Owner:                  root
CreateTime:             Wed Aug 11 20:02:14 UTC 2021
LastAccessTime:         Sun Jan 11 15:25:29 UTC 1970
Retention:              2147483647
Location:               hdfs://namenode:8020/tmp/test3.ice
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        metadata_location
hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json
        numFiles                8
        storage_handler
org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
        table_type              ICEBERG
        totalSize               12577
        transient_lastDdlTime   1628712134

# Storage Information
SerDe Library:          org.apache.iceberg.mr.hive.HiveIcebergSerDe
InputFormat:            org.apache.iceberg.mr.hive.HiveIcebergInputFormat

OutputFormat:           org.apache.iceberg.mr.hive.HiveIcebergOutputFormat

Compressed:             No
Num Buckets:            0
Bucket Columns:         []
Sort Columns:           []
Time taken: 0.319 seconds, Fetched: 29 row(s)

hive> select * from mytable3;
FAILED: SemanticException Table does not exist at location:
hdfs://namenode:8020/tmp/test3.ice

Thanks!



On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <[email protected]> wrote:

> Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there
> is a storage handler set up for it.
>
> On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <[email protected]> wrote:
>
>> Thanks guys. tableProperty("location", ...) works.
>>
>> I have trouble making hive query an iceberg table by following
>> https://iceberg.apache.org/hive/.
>>
>> I have done:
>> *  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
>> * in hive-site.xml, add hive.vectorized.execution.enabled=false and 
>> iceberg.engine.hive.enabled=true.
>> The same hive-site.xml is used by both hive server and spark.
>>
>>
>> This is my code:
>> val table = "hive_test.mydb.mytable3"
>> val filePath = "hdfs://namenode:8020/tmp/test3.ice"
>> df.writeTo(table)
>>     .tableProperty("write.format.default", "parquet")
>>     .tableProperty("location", filePath)
>>     .createOrReplace()
>>
>>
>> The iceberg file is created in the specified location. It can be queried
>> in spark sql.
>> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
>> Found 2 items
>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>> /tmp/test3.ice/data
>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>> /tmp/test3.ice/metadata
>>
>> This hive table is created but cannot be queried:
>> hive> select * from mytable3;
>> FAILED: SemanticException Table does not exist at location:
>> hdfs://namenode:8020/tmp/test3.ice
>>
>> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very
>> close to having a happy path for migrating parquet to iceberg. Thanks.
>>
>>
>>
>> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <[email protected]> wrote:
>>
>>> The problem for #3 is how Spark handles the options. The option method
>>> sets write options, not table properties. The write options aren’t passed
>>> when creating the table. Instead, you should use tableProperty("location",
>>> ...).
>>>
>>> Ryan
>>>
>>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <
>>> [email protected]> wrote:
>>>
>>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>>>> integration from iceberg. So you shouldn't see it in hive unless you have
>>>> configured that, see https://iceberg.apache.org/hive/.
>>>>
>>>> 3)
>>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>>>> I would check what properties are set in the table to see why that
>>>> wasn't set. But "location" would be the correct way of setting the table.
>>>> Unless the property is being ignored by Spark, I'm assuming you are using
>>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which
>>>> ignores options passed to the V2 api sometimes,
>>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in
>>>> 3.1
>>>>
>>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <[email protected]> wrote:
>>>>
>>>> Any help is highly appreciated!
>>>>
>>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Russell.
>>>>>
>>>>> I tried:
>>>>>
>>>>> /spark/bin/spark-shell --packages
>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>>>> --conf spark.sql.catalog.hive_test.type=hive
>>>>>
>>>>> import org.apache.spark.sql.SparkSession
>>>>> val values = List(1,2,3,4,5)
>>>>>
>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>> import spark.implicits._
>>>>> val df = values.toDF()
>>>>>
>>>>> val table = "hive_test.mydb.mytable3"
>>>>> df.writeTo(table)
>>>>>     .tableProperty("write.format.default", "parquet")
>>>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>>>     .createOrReplace()
>>>>>
>>>>> spark.table(table).show()
>>>>>
>>>>> *Observations*:
>>>>> 1. spark.table(table).show() does show the table correctly.
>>>>> +-----+
>>>>> |value|
>>>>> +-----+
>>>>> |    1|
>>>>> |    2|
>>>>> |    3|
>>>>> |    4|
>>>>> |    5|
>>>>> +-----+
>>>>>
>>>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>>>> hive> select * from mytable3;
>>>>> OK
>>>>> Time taken: 0.158 seconds
>>>>>
>>>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>>>
>>>>> Any idea about 2 and 3? Thanks very much.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Specify a property of "location" when creating the table. Just add a
>>>>>> ".option("location", "path")"
>>>>>>
>>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Thanks Russell. This helps a lot.
>>>>>>
>>>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>>>> using dataframe api. All examples using warehouse location are SQL. Do 
>>>>>> you
>>>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>>>> location in the query below? The reason I ask is that my current code all
>>>>>> uses spark API. It will be much easier if I can use spark API when
>>>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>>>
>>>>>> data.writeTo("prod.db.table")
>>>>>>     .tableProperty("write.format.default", "orc")
>>>>>>     .partitionBy($"level", days($"ts"))
>>>>>>     .createOrReplace()
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>>>> reference it you need to either "use hive_prod" or refer to the table 
>>>>>>> with
>>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Ryan.
>>>>>>>>
>>>>>>>> Using this command (uri is omitted because the uri is in
>>>>>>>> hive-site.xml):
>>>>>>>> spark-shell --conf
>>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog 
>>>>>>>> --conf
>>>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>>>
>>>>>>>> This statement:
>>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING
>>>>>>>> iceberg")
>>>>>>>>
>>>>>>>> caused warning:
>>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe
>>>>>>>> for data source provider iceberg.
>>>>>>>>
>>>>>>>> I tried:
>>>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>>>> * use --packages
>>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>>>
>>>>>>>> but they did not help. This warning blocks inserting any data into
>>>>>>>> this table. Any ideas are appreciated!
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Lian,
>>>>>>>>>
>>>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>>>> clear. We have a few configuration pages that are helpful, but it 
>>>>>>>>> looks
>>>>>>>>> like they assume you know what your options are already. Take a look 
>>>>>>>>> at the
>>>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>>>
>>>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>>>> example:
>>>>>>>>>
>>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in 
>>>>>>>>> hive-site.xml
>>>>>>>>>
>>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which
>>>>>>>>> is why it can’t be used by SQL. This is something that should 
>>>>>>>>> probably be
>>>>>>>>> added to Spark and not Iceberg since it is just a different way to 
>>>>>>>>> build
>>>>>>>>> the same underlying Spark plan.
>>>>>>>>>
>>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>>>> over DataFrames so that you don’t end up needing to use Jars produced 
>>>>>>>>> by
>>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But Iceberg
>>>>>>>>> should support both because DataFrames are useful for customization 
>>>>>>>>> in some
>>>>>>>>> cases. It really should be up to you and what you want to use, as far 
>>>>>>>>> as
>>>>>>>>> Iceberg is concerned.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>>>
>>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add
>>>>>>>>>> an external table in hive metastore for this parquet. In the future, 
>>>>>>>>>> when
>>>>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see 
>>>>>>>>>> that
>>>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>>>>
>>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>>>> example, there is no "sql merge into" like support for spark API. Is 
>>>>>>>>>> spark
>>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, 
>>>>>>>>>> could you
>>>>>>>>>> clarify the rationale behind? I personally feel spark API is more dev
>>>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Lian,
>>>>>>>>>>>
>>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's 
>>>>>>>>>>> warehouse
>>>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>>>
>>>>>>>>>>> The only restriction for S3 is that you need a metastore to
>>>>>>>>>>> track the table metadata location because S3 doesn't have a way to
>>>>>>>>>>> implement a metadata commit. For a metastore, there are 
>>>>>>>>>>> implementations
>>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the 
>>>>>>>>>>> upcoming
>>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/.
>>>>>>>>>>>> It should contain all the info that you need. The codebase 
>>>>>>>>>>>> contains a *S3FileIO
>>>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I am reading
>>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. 
>>>>>>>>>>>>> This guide
>>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I 
>>>>>>>>>>>>> understand
>>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 
>>>>>>>>>>>>> path. How
>>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Create your own email signature
>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>
>
> --
> Ryan Blue
> Tabular
>


-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Re: Writing iceberg table to S3

Reply via email to