Re: Writing iceberg table to S3

Ryan Blue Wed, 11 Aug 2021 15:56:06 -0700

Looks like the table is set up correctly. I think the problem might be how
Hive is configured. I think by default it will try to load tables by
location in 0.11.1. You need to tell it to load tables as metastore tables,
not HDFS tables by running `SET iceberg.mr.catalog=hive`.


On Wed, Aug 11, 2021 at 3:51 PM Lian Jiang <[email protected]> wrote:

> hive> describe formatted mytable3;
> OK
> # col_name              data_type               comment
> value                   int
>
> # Detailed Table Information
> Database:               mydb
> OwnerType:              USER
> Owner:                  root
> CreateTime:             Wed Aug 11 20:02:14 UTC 2021
> LastAccessTime:         Sun Jan 11 15:25:29 UTC 1970
> Retention:              2147483647
> Location:               hdfs://namenode:8020/tmp/test3.ice
> Table Type:             EXTERNAL_TABLE
> Table Parameters:
>         EXTERNAL                TRUE
>         metadata_location
> hdfs://namenode:8020/tmp/test3.ice/metadata/00000-0918c08e-16b0-4484-87f3-3c263f0e7d55.metadata.json
>         numFiles                8
>         storage_handler
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
>         table_type              ICEBERG
>         totalSize               12577
>         transient_lastDdlTime   1628712134
>
> # Storage Information
> SerDe Library:          org.apache.iceberg.mr.hive.HiveIcebergSerDe
> InputFormat:            org.apache.iceberg.mr.hive.HiveIcebergInputFormat
>
> OutputFormat:           org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
>
> Compressed:             No
> Num Buckets:            0
> Bucket Columns:         []
> Sort Columns:           []
> Time taken: 0.319 seconds, Fetched: 29 row(s)
>
> hive> select * from mytable3;
> FAILED: SemanticException Table does not exist at location:
> hdfs://namenode:8020/tmp/test3.ice
>
> Thanks!
>
>
>
> On Wed, Aug 11, 2021 at 2:00 PM Ryan Blue <[email protected]> wrote:
>
>> Can you run `DESCRIBE FORMATTED` for the table? Then we can see if there
>> is a storage handler set up for it.
>>
>> On Wed, Aug 11, 2021 at 1:46 PM Lian Jiang <[email protected]> wrote:
>>
>>> Thanks guys. tableProperty("location", ...) works.
>>>
>>> I have trouble making hive query an iceberg table by following
>>> https://iceberg.apache.org/hive/.
>>>
>>> I have done:
>>> *  in Hive shell, do `add jar /path/to/iceberg-hive-runtime.jar;`
>>> * in hive-site.xml, add hive.vectorized.execution.enabled=false and 
>>> iceberg.engine.hive.enabled=true.
>>> The same hive-site.xml is used by both hive server and spark.
>>>
>>>
>>> This is my code:
>>> val table = "hive_test.mydb.mytable3"
>>> val filePath = "hdfs://namenode:8020/tmp/test3.ice"
>>> df.writeTo(table)
>>>     .tableProperty("write.format.default", "parquet")
>>>     .tableProperty("location", filePath)
>>>     .createOrReplace()
>>>
>>>
>>> The iceberg file is created in the specified location. It can be queried
>>> in spark sql.
>>> root@datanode:/# hdfs dfs -ls /tmp/test3.ice/
>>> Found 2 items
>>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>>> /tmp/test3.ice/data
>>> drwxrwxr-x   - root supergroup          0 2021-08-11 20:02
>>> /tmp/test3.ice/metadata
>>>
>>> This hive table is created but cannot be queried:
>>> hive> select * from mytable3;
>>> FAILED: SemanticException Table does not exist at location:
>>> hdfs://namenode:8020/tmp/test3.ice
>>>
>>> I am using spark 3.1.1 and hive 3.1.2. What else am I missing? I am very
>>> close to having a happy path for migrating parquet to iceberg. Thanks.
>>>
>>>
>>>
>>> On Wed, Aug 11, 2021 at 12:40 PM Ryan Blue <[email protected]> wrote:
>>>
>>>> The problem for #3 is how Spark handles the options. The option method
>>>> sets write options, not table properties. The write options aren’t passed
>>>> when creating the table. Instead, you should use tableProperty("location",
>>>> ...).
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Aug 11, 2021 at 9:17 AM Russell Spitzer <
>>>> [email protected]> wrote:
>>>>
>>>>> 2) Hive cannot read Iceberg tables without configuring the MR Hive
>>>>> integration from iceberg. So you shouldn't see it in hive unless you have
>>>>> configured that, see https://iceberg.apache.org/hive/.
>>>>>
>>>>> 3)
>>>>> https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L137
>>>>> I would check what properties are set in the table to see why that
>>>>> wasn't set. But "location" would be the correct way of setting the table.
>>>>> Unless the property is being ignored by Spark, I'm assuming you are using
>>>>> the latest build possible of Spark. There is a bug in 3.0 of Spark which
>>>>> ignores options passed to the V2 api sometimes,
>>>>> https://issues.apache.org/jira/browse/SPARK-32592 . Which is fixed in
>>>>> 3.1
>>>>>
>>>>> On Aug 11, 2021, at 11:00 AM, Lian Jiang <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Any help is highly appreciated!
>>>>>
>>>>> On Tue, Aug 10, 2021 at 11:06 AM Lian Jiang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Russell.
>>>>>>
>>>>>> I tried:
>>>>>>
>>>>>> /spark/bin/spark-shell --packages
>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>> --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog
>>>>>> --conf spark.sql.catalog.hive_test.type=hive
>>>>>>
>>>>>> import org.apache.spark.sql.SparkSession
>>>>>> val values = List(1,2,3,4,5)
>>>>>>
>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>> import spark.implicits._
>>>>>> val df = values.toDF()
>>>>>>
>>>>>> val table = "hive_test.mydb.mytable3"
>>>>>> df.writeTo(table)
>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>> *    .option("location", "hdfs://namenode:8020/tmp/test.ice")*
>>>>>>     .createOrReplace()
>>>>>>
>>>>>> spark.table(table).show()
>>>>>>
>>>>>> *Observations*:
>>>>>> 1. spark.table(table).show() does show the table correctly.
>>>>>> +-----+
>>>>>> |value|
>>>>>> +-----+
>>>>>> |    1|
>>>>>> |    2|
>>>>>> |    3|
>>>>>> |    4|
>>>>>> |    5|
>>>>>> +-----+
>>>>>>
>>>>>> 2. mydb.mytable3 is created in HIVE but it is empty:
>>>>>> hive> select * from mytable3;
>>>>>> OK
>>>>>> Time taken: 0.158 seconds
>>>>>>
>>>>>> 3. test.ice is not generated in the HDFS folder /tmp.
>>>>>>
>>>>>> Any idea about 2 and 3? Thanks very much.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 10, 2021 at 9:38 AM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Specify a property of "location" when creating the table. Just add a
>>>>>>> ".option("location", "path")"
>>>>>>>
>>>>>>> On Aug 10, 2021, at 11:15 AM, Lian Jiang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks Russell. This helps a lot.
>>>>>>>
>>>>>>> I want to specify a HDFS location when creating an iceberg dataset
>>>>>>> using dataframe api. All examples using warehouse location are SQL. Do 
>>>>>>> you
>>>>>>> have an example for dataframe API? For example, how to support HDFS/S3
>>>>>>> location in the query below? The reason I ask is that my current code 
>>>>>>> all
>>>>>>> uses spark API. It will be much easier if I can use spark API when
>>>>>>> migrating parquet to iceberg. Hope it makes sense.
>>>>>>>
>>>>>>> data.writeTo("prod.db.table")
>>>>>>>     .tableProperty("write.format.default", "orc")
>>>>>>>     .partitionBy($"level", days($"ts"))
>>>>>>>     .createOrReplace()
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 9, 2021 at 4:22 PM Russell Spitzer <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> The config you used specified a catalog named "hive_prod", so to
>>>>>>>> reference it you need to either "use hive_prod" or refer to the table 
>>>>>>>> with
>>>>>>>> the catalog identifier "CREATE TABLE hive_prod.default.mytable"
>>>>>>>>
>>>>>>>> On Mon, Aug 9, 2021 at 6:15 PM Lian Jiang <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks Ryan.
>>>>>>>>>
>>>>>>>>> Using this command (uri is omitted because the uri is in
>>>>>>>>> hive-site.xml):
>>>>>>>>> spark-shell --conf
>>>>>>>>> spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog 
>>>>>>>>> --conf
>>>>>>>>> spark.sql.catalog.hive_prod.type=hive
>>>>>>>>>
>>>>>>>>> This statement:
>>>>>>>>> spark.sql("CREATE TABLE default.mytable (uuid string) USING
>>>>>>>>> iceberg")
>>>>>>>>>
>>>>>>>>> caused warning:
>>>>>>>>> WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe
>>>>>>>>> for data source provider iceberg.
>>>>>>>>>
>>>>>>>>> I tried:
>>>>>>>>> * the solution (put iceberg-hive-runtime.jar and
>>>>>>>>> iceberg-spark3-runtime.jar to spark/jars) mentioned in
>>>>>>>>> https://github.com/apache/iceberg/issues/2260
>>>>>>>>> * use --packages
>>>>>>>>> org.apache.iceberg:iceberg-hive-runtime:0.11.1,org.apache.iceberg:iceberg-spark3-runtime:0.11.1
>>>>>>>>>
>>>>>>>>> but they did not help. This warning blocks inserting any data into
>>>>>>>>> this table. Any ideas are appreciated!
>>>>>>>>>
>>>>>>>>> On Mon, Aug 9, 2021 at 10:15 AM Ryan Blue <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Lian,
>>>>>>>>>>
>>>>>>>>>> I think we should improve the docs for catalogs since it isn’t
>>>>>>>>>> clear. We have a few configuration pages that are helpful, but it 
>>>>>>>>>> looks
>>>>>>>>>> like they assume you know what your options are already. Take a look 
>>>>>>>>>> at the
>>>>>>>>>> Spark docs for catalogs, which is the closest we have right now:
>>>>>>>>>> https://iceberg.apache.org/spark-configuration/#catalog-configuration
>>>>>>>>>>
>>>>>>>>>> What you’ll want to do is to configure a catalog like the first
>>>>>>>>>> example:
>>>>>>>>>>
>>>>>>>>>> spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
>>>>>>>>>> spark.sql.catalog.hive_prod.type = hive
>>>>>>>>>> spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port
>>>>>>>>>> # omit uri to use the same URI as Spark: hive.metastore.uris in 
>>>>>>>>>> hive-site.xml
>>>>>>>>>>
>>>>>>>>>> For MERGE INTO, the DataFrame API is not present in Spark, which
>>>>>>>>>> is why it can’t be used by SQL. This is something that should 
>>>>>>>>>> probably be
>>>>>>>>>> added to Spark and not Iceberg since it is just a different way to 
>>>>>>>>>> build
>>>>>>>>>> the same underlying Spark plan.
>>>>>>>>>>
>>>>>>>>>> To your question about dataframes vs SQL, I highly recommend SQL
>>>>>>>>>> over DataFrames so that you don’t end up needing to use Jars 
>>>>>>>>>> produced by
>>>>>>>>>> compiling Scala code. I think it’s easier to just use SQL. But 
>>>>>>>>>> Iceberg
>>>>>>>>>> should support both because DataFrames are useful for customization 
>>>>>>>>>> in some
>>>>>>>>>> cases. It really should be up to you and what you want to use, as 
>>>>>>>>>> far as
>>>>>>>>>> Iceberg is concerned.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 9, 2021 at 9:31 AM Lian Jiang <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Eduard and Ryan.
>>>>>>>>>>>
>>>>>>>>>>> I use spark on a K8S cluster to write parquet on s3 and then add
>>>>>>>>>>> an external table in hive metastore for this parquet. In the 
>>>>>>>>>>> future, when
>>>>>>>>>>> using iceberg, I prefer hive metadata store since it is my
>>>>>>>>>>> centralized metastore for batch and streaming datasets. I don't see 
>>>>>>>>>>> that
>>>>>>>>>>> hive metastore is supported in iceberg AWS integration on
>>>>>>>>>>> https://iceberg.apache.org/aws/. Is there another link for that?
>>>>>>>>>>>
>>>>>>>>>>> Most of the examples use spark sql to write/read iceberg. For
>>>>>>>>>>> example, there is no "sql merge into" like support for spark API. 
>>>>>>>>>>> Is spark
>>>>>>>>>>> sql preferred over spark dataframe/dataset API in Iceberg? If so, 
>>>>>>>>>>> could you
>>>>>>>>>>> clarify the rationale behind? I personally feel spark API is more 
>>>>>>>>>>> dev
>>>>>>>>>>> friendly and scalable. Thanks very much!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 9, 2021 at 8:53 AM Ryan Blue <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Lian,
>>>>>>>>>>>>
>>>>>>>>>>>> Iceberg tables work great in S3. When creating the table, just
>>>>>>>>>>>> pass the `LOCATION` clause with an S3 path, or set your catalog's 
>>>>>>>>>>>> warehouse
>>>>>>>>>>>> location to S3 so tables are automatically created there.
>>>>>>>>>>>>
>>>>>>>>>>>> The only restriction for S3 is that you need a metastore to
>>>>>>>>>>>> track the table metadata location because S3 doesn't have a way to
>>>>>>>>>>>> implement a metadata commit. For a metastore, there are 
>>>>>>>>>>>> implementations
>>>>>>>>>>>> backed by the Hive MetaStore, Glue/DynamoDB, and Nessie. And the 
>>>>>>>>>>>> upcoming
>>>>>>>>>>>> release adds support for DynamoDB without Glue and JDBC.
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Aug 9, 2021 at 2:24 AM Eduard Tudenhoefner <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Lian you can have a look at https://iceberg.apache.org/aws/.
>>>>>>>>>>>>> It should contain all the info that you need. The codebase 
>>>>>>>>>>>>> contains a *S3FileIO
>>>>>>>>>>>>> *class, which is an implementation that is backed by S3.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 9, 2021 at 7:37 AM Lian Jiang <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am reading
>>>>>>>>>>>>>> https://iceberg.apache.org/spark-writes/#spark-writes and
>>>>>>>>>>>>>> wondering if it is possible to create an iceberg table on S3. 
>>>>>>>>>>>>>> This guide
>>>>>>>>>>>>>> seems to say only write to a hive table (backed up by HDFS if I 
>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>> correctly). Hudi and Delta can write to s3 with a specified S3 
>>>>>>>>>>>>>> path. How
>>>>>>>>>>>>>> can I do it using iceberg? Thanks for any clarification.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Tabular
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Create your own email signature
>>>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Create your own email signature
>>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>


-- 
Ryan Blue
Tabular

Re: Writing iceberg table to S3

Reply via email to