Apache Iceberg Daily Slack Digest (2021-08-27)

Iceberg Slack Email Digest Fri, 27 Aug 2021 02:00:24 -0700

#general

@sangarshanan: Hi all ! just wanted to know if it is possible to create and update Iceberg tables without using spark for compute ? using just the java APIs
@dweeks: Yes, you can use the java APIs directly to update tables. You still need a catalog implementation for handling the commits, but everything should be possible through the APIs.
@dweeks: Could you share your use case? Just wondering if you're trying to simply append existing files or you're trying to integrate something else.
@sangarshanan: my usecase is to build a service that can perform CRUD on iceberg tables backed by Hive as catalog and provide an abstraction for users to run it without the overhead of spark
@sangarshanan: this is roughly what I came up with but I got hit with some bugs that I am trying to solve ```Configuration conf = new Configuration(); conf.set("hive.metastore.uris", ""); Catalog catalog = new HiveCatalog(conf); TableIdentifier name = TableIdentifier.of("logging", "logs"); Table table = catalog.createTable(name, schema, spec);``` just wanted to know if I am moving in the right direction since I could not find any examples that used the Java APIs to directly Create and Update iceberg tables
@dweeks: That all looks right. From the table you should be able to use the APIs to manipulate the table. Those are the same APIs that Spark/Trino/Flink use, so you might be able to find examples of the API use there.
@dweeks: The tests are also a good place to look since they exercise all of these code paths.
@aman.rawat: Hi everyone, Is there any high level api/abstraction available to enable row level deletes as part of delete, update or merge operations in table format v2.
@russell.spitzer: Row level deletes are still not supported in the Spark api so there is no way to enabled them there
@aman.rawat: Thanks @russell.spitzer for the update.
@gsreeramkumar: Hi @russell.spitzer, can you pl. throw some light on: 1. what is the overall state of record level deletes? Is it supported in any other engine - or is it just implemented in the core api and is waiting for engine specific adaption? 2. Is there a work stream or ongoing work for implementing this in Spark - where we can come and contribute? Truly appreciate your inputs.
@mohamed.jouini.pro: Hi everyone. I tried to use Iceberg release 0.12.0 with Dynamodb catalog, and this is my SparkSession configuration: ```.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") .set("spark.sql.catalog.iceberg_dynamo_poc", "org.apache.iceberg.spark.SparkCatalog") .set("spark.sql.catalog.iceberg_dynamo_poc.warehouse", "") .set("spark.sql.catalog.iceberg_dynamo_poc.catalog-impl", "org.apache.iceberg.aws.dynamodb.DynamoDbCatalog") .set("spark.sql.catalog.iceberg_dynamo_poc.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .set("spark.sql.catalog.iceberg_dynamo_poc.dynamodb.table-name", "IcebergCatalog")``` and I got this error when creating table ```spark.sql ("CREATE TABLE iceberg_dynamo_poc.dynamo1 ( \ id bigint, \ pathId string, \ ) \ USING iceberg " )``` ```Py4JJavaError: An error occurred while calling o1836.sql. : java.lang.NoSuchMethodError: org.apache.iceberg.aws.AwsProperties.dynamoDbTableName()Ljava/lang/String; at org.apache.iceberg.aws.dynamodb.DynamoDbCatalog.ensureCatalogTableExistsOrCreate(DynamoDbCatalog.java:537) at org.apache.iceberg.aws.dynamodb.DynamoDbCatalog.initialize(DynamoDbCatalog.java:133) at org.apache.iceberg.aws.dynamodb.DynamoDbCatalog.initialize(DynamoDbCatalog.java:118) at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:183)```
@russell.spitzer: Looks like you are missing some things on your classpath, did you make sure to include iceberg-spark3-runtime and the aws SDK bundle and connection client?
@russell.spitzer:
@dweeks: I suspect that the issue is actually due to not having the aws-java-sdk-v2 in your classpath
@dweeks: Dynamo (and most of the native AWS support) uses sdk v2, which is not part of the bundle
@russell.spitzer: oh do we need that in the docs then?
@dweeks: Probably, though maybe we should just bundle it with `iceberg-aws`?
@russell.spitzer: As long as it isn't versioned with the other aws libs I think that's fine
@russell.spitzer: I assumed we didn't include the other libs so that patch releases would be easier to incorporate for end users
@blue: I think the docs cover adding it to the classpath, but it may not be called out very clearly
@russell.spitzer: This is what the doc's list in the startup instructions ```# add Iceberg dependency ICEBERG_VERSION=0.12.0 DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION" # add AWS dependnecy AWS_SDK_VERSION=2.15.40 AWS_MAVEN_GROUP=software.amazon.awssdk AWS_PACKAGES=( "bundle" "url-connection-client" ) for pkg in "${AWS_PACKAGES[@]}"; do DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" done```
@russell.spitzer: bundle, url-connection, spark3-runtime
@blue: Yeah, but does that require reading a shell script or is it stated that you need bundle and url-connection-client?
@russell.spitzer: oh sorry I thought you meant it listed aws-java-sdk-v2
@dweeks: no, the sdk v2 is `software.amazon.awssdk`
@russell.spitzer: I thought Daniel was noting that a 4th dependency was also required that is currently not a part of that script
@dweeks: Looks like all that's required is there, but this step may have been missed or it didn't get into the classpath correctly
@russell.spitzer: ah so it is just those 2 other libs, then yes I agree we should pull those out and not write this as a shell script
@dweeks: Yeah, even in the examples it requires a pull from maven, which is not ideal
@russell.spitzer: I still understand though if we want to support changing the patch version
@russell.spitzer: but in this example i would just enumerate everything explictly
@blue: For 3 packages, having a loop doesn't make a ton of sense to me
@russell.spitzer: the loop is just for 2 of them as well :slightly_smiling_face:
@blue: It's a good script for setting up EMR, but probably better to be simple for CLI use
@dweeks: I guess we should probably have @mohamed.jouini.pro verify that this is the issue though
@mohamed.jouini.pro: Please find the entire `pyspark` code ```from pyspark.sql import SparkSession conf = (sc.getConf() .set("spark.jars", "iceberg-spark-runtime-0.12.0.jar,bundle-2.15.40.jar,url-connection-client-2.15.40.jar") .set("spark.jars.packages", "org.apache.spark:spark-avro_2.12:4.0.0") .set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") .set("spark.sql.catalog.iceberg_dynamo_poc", "org.apache.iceberg.spark.SparkCatalog") .set("spark.sql.catalog.iceberg_dynamo_poc.warehouse", "") .set("spark.sql.catalog.iceberg_dynamo_poc.catalog-impl", "org.apache.iceberg.aws.dynamodb.DynamoDbCatalog") .set("spark.sql.catalog.iceberg_dynamo_poc.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .set("spark.sql.catalog.iceberg_dynamo_poc.dynamodb.table-name", "IcebergCatalog") .set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .set("spark.executor.extraJavaOptions", "-Dlog4j.configuration=./log4j.properties") ) spark.stop() spark = SparkSession.builder.config(conf=conf).getOrCreate()```
@blue: @mohamed.jouini.pro, thanks for that. Have you downloaded the AWS SDK jars as well? I don't think that Spark will download them for you
@russell.spitzer: Yeah if you set it in "jars" it will not download dependencies of those libraries
@dweeks: Hmm, so `bundle-2.15.40.jar,url-connection-client-2.15.40.jar` these are the jars in question.
@russell.spitzer: "Packages" will download the library and all their dependencies
@dweeks: Since their alongside the iceberg runtime, maybe they were already downloaded?
@russell.spitzer: but if they have secondary deps those would be missing, I also think this is one of those ones that spark only warns on if the jars are missing
@mohamed.jouini.pro: jar packages are already in EMR local lib path ```[hadoop@emr~]$ ls -ltr /usr/share/aws/aws-java-sdk/iceberg* -rw-r--r-- 1 root root 19201102 Aug 10 15:49 /usr/share/aws/aws-java-sdk/iceberg-spark3-runtime-0.11.1.jar -rw-r--r-- 1 root root 25809685 Aug 26 13:51 /usr/share/aws/aws-java-sdk/iceberg-spark-runtime-0.12.0.jar [hadoop@emr ~]$ ls -ltr /usr/share/aws/aws-java-sdk/url-connection* -rw-r--r-- 1 root root 21027 Aug 10 15:49 /usr/share/aws/aws-java-sdk/url-connection-client-2.15.40.jar [hadoop@emr ~]$ ls -ltr /usr/share/aws/aws-java-sdk/bundle* -rw-r--r-- 1 root root 257939967 Aug 10 15:49 /usr/share/aws/aws-java-sdk/bundle-2.15.40.jar```
@mohamed.jouini.pro: I can create a table using aws glue catalog, but the lem seems to be related to `DynamoDBCatalog`
@russell.spitzer: Neither of those libs have runtime deps according to maven, although it is amazing to me that bundle is 246 MB!
@mohamed.jouini.pro:
@mohamed.jouini.pro: When reading the error message it looks like Iceberg can't convert table-name to `String` or something like this, In the documentation Iceberg, can use a default dynamodbcatalog `iceberg` if it's not set, but I still get the same output
@russell.spitzer: The error message here ```java.lang.NoSuchMethodError: org.apache.iceberg.aws.AwsProperties.dynamoDbTableName()Ljava/lang/String```
@russell.spitzer: Says it's looking for a method called dynamoDbTableName() which returns a String and is not finding it on the classpath
@russell.spitzer: Specifically this one
@russell.spitzer: Is EMR going to put that older "iceberg-spark3-runtime" jar on the classpath even if you don't add it to jars? Because that could cause the issue since the method doesn't exist in 11.1
@russell.spitzer: ```-rw-r--r-- 1 root root 19201102 Aug 10 15:49 /usr/share/aws/aws-java-sdk/iceberg-spark3-runtime-0.11.1.jar << That one```
@russell.spitzer: Also shouldn't you be using Spark3 0.12 runtime? not Spark-runtime?
@mohamed.jouini.pro: @russell.spitzer, the same errors when using spark3 runtime
@russell.spitzer: yes but are you sure the 0.11.1 jar isn't on the classpath?
@russell.spitzer: That would cause the error
@mohamed.jouini.pro: Ahh, let me remove the old one from aws classpath
@mohamed.jouini.pro: Thx @russell.spitzer, After removing the old runtime jars, I see that Iceberg can create DynamoDB table, but I can't create a table ```spark.sql ("CREATE TABLE iceberg_dynamo_poc.iceberg_dynamo_poc.dynamo1 ( \ id bigint, \ pathId string, \ ) \ USING iceberg " )``` ```org.apache.iceberg.exceptions.NoSuchNamespaceException: Cannot find default warehouse location: namespace iceberg_dynamo_poc does not exist at org.apache.iceberg.aws.dynamodb.DynamoDbCatalog.defaultWarehouseLocation(DynamoDbCatalog.java:158) at org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:211) at org.apache.iceberg.CachingCatalog$CachingTableBuilder.lambda$create$0(CachingCatalog.java:212) at org.apache.iceberg.shaded.com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2344) at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)```
@russell.spitzer: ?
@mohamed.jouini.pro: Iceberg has to create the namespace `iceberg_dynamo_poc` if doesn't exist, right ?
@russell.spitzer: You would need to make it if it doesn't exist
@russell.spitzer: looking breifly at the docs, it seems like they don't use a namespace in their table references
@russell.spitzer: just catalog.tableName
@mohamed.jouini.pro: ```spark.sql ("CREATE TABLE iceberg_dynamo_poc.dynamo1 ( \ id bigint, \ pathId string, \ ) \ USING iceberg " ) Py4JJavaError: An error occurred while calling o554.sql. : org.apache.iceberg.exceptions.ValidationException: Table namespace must not be empty: dynamo1```
@russell.spitzer: i believe that is because you didn't set a location? @jackye ^
@mohamed.jouini.pro: I think it should create a namespace, please
@mohamed.jouini.pro: But it should use the `catalog.warehouse` as the default table location
@russell.spitzer: I don't know how they set this up, but in most systems I would imagine you would have to first CREATE DATABASE before putting a table in it, but I don't know how the DynamoCatalog was configured
@dweeks: @russell.spitzer is correct
@jackye: reading the threads now
@dweeks: The namespace needs to exist via a `create database <catalog>.<database>`
@jackye: +1 for what Daniel says, a database needs to be created before table creation in Spark.
@dweeks: I think there is some strangeness in spark in terms of catalogs and databases as well.
@jackye: for ```AWS_PACKAGES=( "bundle" "url-connection-client" )``` the intention was that bundle includes all AWS service packages and can work if you just want to test running on EMR with a bootstrap script, but you can change `bundle` to the list of aws services you use for a script good for production use.
@dweeks: For example, I believe the `use` command can be in context of catalog or database
@mohamed.jouini.pro: Thx, it works after creating the database, it's not the same behavior when working with AWS Glue, I think that namespace is created if doesn't exist Thank you @russell.spitzer, @dweeks @jackye @blue
@jackye: I think you might already have a database of that name in Glue in the past, that’s why you did not need to create it. Otherwise the behavior should be consistent across all Iceberg catalog implementations
@mohamed.jouini.pro: I will test it again with glue
@gsreeramkumar: Hello folks! Is there a way to specify while writing a Row to an Iceberg Table from Spark - *that a specific column is non-existent for that Row*? My question is NOT about `null` - but about `non-existent`? & if yes, is there a way to consume this in Spark? Truly appreciate any help!

Apache Iceberg Daily Slack Digest (2021-08-27)

#general

Reply via email to