aokolnychyi commented on a change in pull request #678: Add Java code examples and update site docs URL: https://github.com/apache/incubator-iceberg/pull/678#discussion_r398928391
########## File path: examples/README.md ########## @@ -0,0 +1,198 @@ +# Iceberg Java API Examples (with Spark) + +## About +Welcome! :smile: + +If you've stumbled across this module, hopefully you're looking for some guidance on how to get started with the [Apache Iceberg](https://iceberg.apache.org/) table format. This set of classes collects code examples of how to use the Iceberg Java API with Spark, along with some extra detail here in the README. + +The examples are structured as JUnit tests that you can download and run locally if you want to mess around with Iceberg yourself. + +## Using Iceberg +### Maven +If you'd like to try out Iceberg in your own project using Spark, you can use the `iceberg-spark-runtime` dependency: +```xml + <dependency> + <groupId>org.apache.iceberg</groupId> + <artifactId>iceberg-spark-runtime</artifactId> + <version>${iceberg.version}</version> + </dependency> +``` + +You'll also need `spark-sql`: +```xml + <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.12</artifactId> + <version>2.4.4</version> + </dependency> +``` + +### Gradle +To add a dependency on Iceberg in Gradle, add the following to `build.gradle`: +``` +dependencies { + compile 'org.apache.iceberg:iceberg-core:0.7.0-incubating' +} +``` + +## Key features investigated +The following section will break down the different areas of Iceberg explored in the examples, with links to the code and extra information that could be useful for new users. + +### Writing data to tables +There are multiple ways of creating tables with Iceberg, including using the Hive Metastore to keep track of tables ([HiveCatalog](https://iceberg.apache.org/api-quickstart/#using-a-hive-catalog)), or using HDFS / your local file system ([HadoopTables](https://iceberg.incubator.apache.org/api-quickstart/#using-hadoop-tables)) to store the tables. However, it should be noted that directory tables (such as those using `HadoopTables`) don’t support all catalog operations, like rename and therefore use the `Tables` interface instead of the `Catalog` interface. +It should be noted that Hadoop tables _shouldn’t_ be used with file systems that do not support atomic rename as Iceberg depends on this to synchronize concurrent commits. +To limit complexity, these examples create tables on your local file system using the HadoopTables class. + +To write Iceberg tables you will need to use the Iceberg API to create a `Schema` and `PartitionSpec` which you use with a Spark `DataFrameWriter` to create Iceberg an `Table`. + +Code examples: +- [Unpartitioned tables](src/test/java/WriteToUnpartitionedTableTest.java) +- [Partitioned tables](src/test/java/WriteToPartitionedTableTest.java) + +#### A quick look at file structures +It could be interesting to note that when writing partitioned data, Iceberg will layout your files in a similar manner to Hive: + +``` +├── data +│ ├── published_month=2017-09 +│ │ └── 00000-1-5cbc72f6-7c1a-45e4-bb26-bc30deaca247-00002.parquet +│ ├── published_month=2018-09 +│ │ └── 00000-1-5cbc72f6-7c1a-45e4-bb26-bc30deaca247-00001.parquet +│ ├── published_month=2018-11 +│ │ └── 00000-1-5cbc72f6-7c1a-45e4-bb26-bc30deaca247-00000.parquet +│ └── published_month=null +│ └── 00000-1-5cbc72f6-7c1a-45e4-bb26-bc30deaca247-00003.parquet +└── metadata + └── version-hint.text +``` +**WARNING** +It should be noted that it is not possible to just drag-and-drop data files into an Iceberg table like the one shown above and expect to see your data in the table. Review comment: Good point! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
