[GitHub] [incubator-iceberg] cmathiesen commented on a change in pull request #678: Add Java code examples and update site docs

GitBox Fri, 03 Jan 2020 00:15:09 -0800

cmathiesen commented on a change in pull request #678: Add Java code examples 
and update site docs
URL: https://github.com/apache/incubator-iceberg/pull/678#discussion_r362729972


 ##########
 File path: examples/README.md
 ##########
 @@ -0,0 +1,182 @@
+# Iceberg Java API Examples (with Spark)
+
+## About
+Welcome! :smile:
+
+If you've stumbled across this module, hopefully you're looking for some 
guidance on how to get started with the [Apache 
Iceberg](https://iceberg.apache.org/) table format. This set of classes 
collects code examples of how to use the Iceberg Java API with Spark, along 
with some extra detail here in the README.
+
+The examples are structured as JUnit tests that you can download and run 
locally if you want to mess around with Iceberg yourself. 
+
+## Running Iceberg yourself
+If you'd like to try out Iceberg in your own project, you can use the 
`spark-iceberg-runtime` dependency:
+```xml
+   <dependency>
+     <groupId>org.apache.iceberg</groupId>
+     <artifactId>iceberg-spark-runtime</artifactId>
+     <version>${iceberg.version}</version>
+   </dependency>
+```
+
+You'll also need `spark-sql`:
+```xml
+  <dependency> 
+    <groupId>org.apache.spark</groupId>
+    <artifactId>spark-sql_2.12</artifactId>
+    <version>2.4.4</version>
+  </dependency>
+```
+
+## Key features investigated
+The following section will break down the different areas of Iceberg explored 
in the examples, with links to the code and extra information that could be 
useful for new users. 
+
+### Writing data to tables
+There are multiple ways of creating tables with Iceberg, including using the 
Hive Metastore to keep track of tables (HiveCatalog), or using HDFS/ your local 
file system to store the tables. To limit complexity, these examples create 
tables on your local file system using the HadoopTables class.
+
+To write Iceberg tables you will need to use the Iceberg API to create a 
`Schema` and `PartitionSpec` which you use with a Spark `DataFrameWriter` to 
create Iceberg an `Table`.
+
+Code examples:
+- [Unpartitioned tables](src/test/java/WriteToUnpartitionedTableTest.java)
+- [Partitioned tables](src/test/java/WriteToPartitionedTableTest.java)
+
+#### A look quick look at file structures
+It could be interesting to note that when writing partitioned data, Iceberg 
will layout your files in a similar manner to Hive:
+
+``` 
+├── data
+│   ├── published_month=2017-09
+│   │   └── 00000-1-5cbc72f6-7c1a-45e4-bb26-bc30deaca247-00002.parquet
+│   ├── published_month=2018-09
+│   │   └── 00000-1-5cbc72f6-7c1a-45e4-bb26-bc30deaca247-00001.parquet
+│   ├── published_month=2018-11
+│   │   └── 00000-1-5cbc72f6-7c1a-45e4-bb26-bc30deaca247-00000.parquet
+│   └── published_month=null
+│       └── 00000-1-5cbc72f6-7c1a-45e4-bb26-bc30deaca247-00003.parquet
+└── metadata
+    └── version-hint.text
+```
+
+### Reading data from tables
+Reading Iceberg tables is fairly simple using the Spark `DataFrameReader`.
+
+Code examples:
+- [Unpartitioned table](src/test/java/ReadFromUnpartitionedTableTest.java)
+- [Partitioned tabled](src/test/java/ReadFromPartitionedTableTest.java)
+
+### A look at the metadata
+This section looks a little bit closer at the metadata produced by Iceberg 
tables. Consider an example where you've written a single json file to a table. 
Your metadata folder will look something like this:
+
+``` 
+├── data
+│   └── ...
+└── metadata
+    ├── 51accd1d-39c7-4a6e-8f35-9e05f7c67864-m0.avro
+    ├── snap-1335014336004891572-1-51accd1d-39c7-4a6e-8f35-9e05f7c67864.avro
+    ├── v1.metadata.json
+    ├── v2.metadata.json
+    └── version-hint.text
+```
+
+The metadata for your table is kept in json files (`v1.metadata.json` and 
`v2.metadata.json`). Version 1 of the metadata is written when your table is 
first created. It contains things like the table location, the schema and the 
partition spec:
 
 Review comment:
   Actually, apologies, would you be able to expand on what you mean by "table 
is stored in a tree from on of the metadata files"? I'd like to make sure I'm 
understanding you correctly :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-iceberg] cmathiesen commented on a change in pull request #678: Add Java code examples and update site docs

Reply via email to