[incubator-iceberg] branch master updated: Fix spelling, grammar, and formatting in the spec (#308)

blue Wed, 24 Jul 2019 12:04:50 -0700

This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-iceberg.git



The following commit(s) were added to refs/heads/master by this push:
     new 66fa048  Fix spelling, grammar, and formatting in the spec (#308)
66fa048 is described below

commit 66fa0482483e459b214acca055a5164ee44db3ac
Author: David Christle <[email protected]>
AuthorDate: Wed Jul 24 12:04:32 2019 -0700

    Fix spelling, grammar, and formatting in the spec (#308)
---
 site/docs/spec.md | 82 +++++++++++++++++++++++++++----------------------------
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/site/docs/spec.md b/site/docs/spec.md
index afed11f..a5e638e 100644
--- a/site/docs/spec.md
+++ b/site/docs/spec.md
@@ -36,25 +36,25 @@ If the snapshot on which an update is based is no longer 
current, the writer mus
 
 Iceberg only requires that file systems support the following operations:
 
-* **In-place write**: files are not moved or altered once they are written
-* **Seekable reads**: data file formats require seek support
-* **Deletes**: tables delete files that are no longer used
+* **In-place write** -- Files are not moved or altered once they are written.
+* **Seekable reads** -- Data file formats require seek support.
+* **Deletes** -- Tables delete files that are no longer used.
 
 These requirements are compatible with object stores, like S3.
 
 Tables do not require random-access writes. Once written, data and metadata 
files are immutable until they are deleted.
 
-Tables do not require rename, except fo rtables that use atomic rename to 
implement the commit operation for new metadata files.
+Tables do not require rename, except for tables that use atomic rename to 
implement the commit operation for new metadata files.
 
 ## Specification
 
 #### Terms
 
-* **Schema** -- names and types of fields in a table
-* **Partition spec** -- a definition of how partition values are derived from 
data fields
-* **Snapshot** -- the state of a table at some point in time, including the 
set of all data files
-* **Manifest** -- a file that lists data files; a subset of a snapshot
-* **Manifest list** -- a file that lists manifest files; one per snapshot
+* **Schema** -- Names and types of fields in a table.
+* **Partition spec** -- A definition of how partition values are derived from 
data fields.
+* **Snapshot** -- The state of a table at some point in time, including the 
set of all data files.
+* **Manifest** -- A file that lists data files; a subset of a snapshot.
+* **Manifest list** -- A file that lists manifest files; one per snapshot.
 
 ### Schemas and Data Types
 
@@ -241,16 +241,16 @@ Notes:
 
 A snapshot consists of the following fields:
 
-*   **`snapshot-id`**: a unique long ID.
-*   **`parent-snapshot-id`**: (optional) the snapshot ID of the snapshot’s 
parent. This field is not present for snapshots that have no parent snapshot, 
such as snapshots created before this field was added or the first snapshot of 
a table.
-*   **`timestamp-ms`**: a timestamp when the snapshot was created. This is 
used when garbage collecting snapshots.
-*   **`manifests`**: a list of manifest file locations. The data files in a 
snapshot are the union of all data files listed in these manifests. (Deprecated 
in favor of `manifest-list`)
-*   **`manifest-list`**: (optional) the location of a manifest list file for 
this snapshot, which contains a list of manifest files with additional 
metadata. If present, the manifests field must be omitted.
-*   **`summary`**: (optional) a summary that encodes the `operation` that 
produced the snapshot and other relevant information specific to that 
operation. This allows some operations like snapshot expiration to skip 
processing some snapshots. Possible values of `operation` are:
-    *   `append`: data files were added and no files were removed.
-    *   `replace`: data files were rewritten with the same data; i.e., 
compaction, changing the data file format, or relocating data files.
-    *   `overwrite`: data files were deleted and added in a logical overwrite 
operation.
-    *   `delete`: data files were removed and their contents logically deleted.
+*   **`snapshot-id`** -- A unique long ID.
+*   **`parent-snapshot-id`** -- (Optional) The snapshot ID of the snapshot’s 
parent. This field is not present for snapshots that have no parent snapshot, 
such as snapshots created before this field was added or the first snapshot of 
a table.
+*   **`timestamp-ms`** -- A timestamp when the snapshot was created. This is 
used when garbage collecting snapshots.
+*   **`manifests`** -- A list of manifest file locations. The data files in a 
snapshot are the union of all data files listed in these manifests. (Deprecated 
in favor of `manifest-list`)
+*   **`manifest-list`** -- (Optional) The location of a manifest list file for 
this snapshot, which contains a list of manifest files with additional 
metadata. If present, the manifests field must be omitted.
+*   **`summary`** -- (Optional) A summary that encodes the `operation` that 
produced the snapshot and other relevant information specific to that 
operation. This allows some operations like snapshot expiration to skip 
processing some snapshots. Possible values of `operation` are:
+    *   `append` -- Data files were added and no files were removed.
+    *   `replace` -- Data files were rewritten with the same data; i.e., 
compaction, changing the data file format, or relocating data files.
+    *   `overwrite` -- Data files were deleted and added in a logical 
overwrite operation.
+    *   `delete` -- Data files were removed and their contents logically 
deleted.
 
 Snapshots can be split across more than one manifest. This enables:
 
@@ -265,7 +265,7 @@ Valid snapshots are stored as a list in table metadata. For 
serialization, see A
 
 Scans are planned by reading the manifest files for the current snapshot 
listed in the table metadata. Deleted entries in a manifest are not included in 
the scan.
 
-For each manifest, scan predicates, that filter data rows, are converted to 
partition predicates, that filter data files, and used to select the data files 
in the manifest. This conversion uses the partition spec used to write the 
manifest file.
+For each manifest, scan predicates, which filter data rows, are converted to 
partition predicates, which filter data files. These partition predicates are 
used select the data files in the manifest. This conversion uses the partition 
spec used to write the manifest file.
 
 Scan predicates are converted to partition predicates using an inclusive 
projection: if a scan predicate matches a row, then the partition predicate 
must match that row’s partition. This is an _inclusive projection_ [1] because 
rows that do not match the scan predicate may be included in the scan by the 
partition predicate.
 
@@ -298,7 +298,7 @@ Manifest list files store `manifest_file`, a struct with 
the following fields:
 
 | Field id, name          | Type                    | Description              
                                                                   |
 
|-------------------------|-------------------------|---------------------------------------------------------------------------------------------|
-| **`509 contains_null`** | `boolean`               | Whether the manifest 
contains at least one partition with a null value for the field        |
+| **`509 contains_null`** | `boolean`               | Whether the manifest 
contains at least one partition with a null value for the field.       |
 | **`510 lower_bound`**   | `optional bytes`    [1] | Lower bound for the 
non-null values in the partition field, or null if all values are null. |
 | **`511 upper_bound`**   | `optional bytes`    [1] | Upper bound for the 
non-null values in the partition field, or null if all values are null. |
 
@@ -403,17 +403,17 @@ Maps with non-string keys must use an array 
representation with the `map` logica
 |**`float`**|`float`||
 |**`double`**|`double`||
 |**`decimal(P,S)`**|`{ "type": "fixed",`<br />&nbsp;&nbsp;`"size": 
minBytesRequired(P),`<br />&nbsp;&nbsp;`"logicalType": "decimal",`<br 
/>&nbsp;&nbsp;`"precision": P,`<br />&nbsp;&nbsp;`"scale": S }`|Stored as fixed 
using the minimum number of bytes for the given precision.|
-|**`date`**|`{ "type": "int",`<br />&nbsp;&nbsp;`"logicalType": "date" 
}`|Stores days from the 1970-01-01|
-|**`time`**|`{ "type": "long",`<br />&nbsp;&nbsp;`"logicalType": "time-micros" 
}`|Stores microseconds from midnight|
-|**`timestamp`**|`{ "type": "long",`<br />&nbsp;&nbsp;`"logicalType": 
"timestamp-micros",`<br />&nbsp;&nbsp;`"adjust-to-utc": false }`|Stores 
microseconds from 1970-01-01 00:00:00.000000|
-|**`timestamptz`**|`{ "type": "long",`<br />&nbsp;&nbsp;`"logicalType": 
"timestamp-micros",`<br />&nbsp;&nbsp;`"adjust-to-utc": true }`|Stores 
microseconds from 1970-01-01 00:00:00.000000 UTC|
+|**`date`**|`{ "type": "int",`<br />&nbsp;&nbsp;`"logicalType": "date" 
}`|Stores days from the 1970-01-01.|
+|**`time`**|`{ "type": "long",`<br />&nbsp;&nbsp;`"logicalType": "time-micros" 
}`|Stores microseconds from midnight.|
+|**`timestamp`**|`{ "type": "long",`<br />&nbsp;&nbsp;`"logicalType": 
"timestamp-micros",`<br />&nbsp;&nbsp;`"adjust-to-utc": false }`|Stores 
microseconds from 1970-01-01 00:00:00.000000.|
+|**`timestamptz`**|`{ "type": "long",`<br />&nbsp;&nbsp;`"logicalType": 
"timestamp-micros",`<br />&nbsp;&nbsp;`"adjust-to-utc": true }`|Stores 
microseconds from 1970-01-01 00:00:00.000000 UTC.|
 |**`string`**|`string`||
 |**`uuid`**|`{ "type": "fixed",`<br />&nbsp;&nbsp;`"size": 16,`<br 
/>&nbsp;&nbsp;`"logicalType": "uuid" }`||
 |**`fixed(L)`**|`{ "type": "fixed",`<br />&nbsp;&nbsp;`"size": L }`||
 |**`binary`**|`bytes`||
 |**`struct`**|`record`||
 |**`list`**|`array`||
-|**`map`**|`array` of key-value records, or `map` when keys are strings 
(optional)|Array storage must use logical type name `map` and must store 
elements that are 2-field records. The first field is a non-null key and the 
second field is the value.|
+|**`map`**|`array` of key-value records, or `map` when keys are strings 
(optional).|Array storage must use logical type name `map` and must store 
elements that are 2-field records. The first field is a non-null key and the 
second field is the value.|
 
 
 **Field IDs**
@@ -449,17 +449,17 @@ Lists must use the [3-level 
representation](https://github.com/apache/parquet-fo
 | **`float`**        | `float`                                                 
           |                                             |                      
                                          |
 | **`double`**       | `double`                                                
           |                                             |                      
                                          |
 | **`decimal(P,S)`** | `P <= 9`: `int32`,<br />`P <= 18`: `int64`,<br 
/>`fixed` otherwise | `DECIMAL(P,S)`                              | Fixed must 
use the minimum number of bytes that can store `P`. |
-| **`date`**         | `int32`                                                 
           | `DATE`                                      | Stores days from the 
1970-01-01                                |
-| **`time`**         | `int64`                                                 
           | `TIME_MICROS` with `adjustToUtc=false`      | Stores microseconds 
from midnight                              |
-| **`timestamp`**    | `int64`                                                 
           | `TIMESTAMP_MICROS` with `adjustToUtc=false` | Stores microseconds 
from 1970-01-01 00:00:00.000000            |
-| **`timestamptz`**  | `int64`                                                 
           | `TIMESTAMP_MICROS` with `adjustToUtc=true`  | Stores microseconds 
from 1970-01-01 00:00:00.000000 UTC        |
-| **`string`**       | `binary`                                                
           | `UTF8`                                      | Encoding must be 
UTF-8                                         |
+| **`date`**         | `int32`                                                 
           | `DATE`                                      | Stores days from the 
1970-01-01.                               |
+| **`time`**         | `int64`                                                 
           | `TIME_MICROS` with `adjustToUtc=false`      | Stores microseconds 
from midnight.                             |
+| **`timestamp`**    | `int64`                                                 
           | `TIMESTAMP_MICROS` with `adjustToUtc=false` | Stores microseconds 
from 1970-01-01 00:00:00.000000.           |
+| **`timestamptz`**  | `int64`                                                 
           | `TIMESTAMP_MICROS` with `adjustToUtc=true`  | Stores microseconds 
from 1970-01-01 00:00:00.000000 UTC.       |
+| **`string`**       | `binary`                                                
           | `UTF8`                                      | Encoding must be 
UTF-8.                                        |
 | **`uuid`**         | `fixed_len_byte_array[16]`                              
           | `UUID`                                      |                      
                                          |
 | **`fixed(L)`**     | `fixed_len_byte_array[L]`                               
           |                                             |                      
                                          |
 | **`binary`**       | `binary`                                                
           |                                             |                      
                                          |
 | **`struct`**       | `group`                                                 
           |                                             |                      
                                          |
-| **`list`**         | `3-level list`                                          
           | `LIST`                                      | See Parquet docs for 
3-level representation                    |
-| **`map`**          | `3-level map`                                           
           | `MAP`                                       | See Parquet docs for 
3-level representation                    |
+| **`list`**         | `3-level list`                                          
           | `LIST`                                      | See Parquet docs for 
3-level representation.                   |
+| **`map`**          | `3-level map`                                           
           | `MAP`                                       | See Parquet docs for 
3-level representation.                   |
 
 
 ### ORC
@@ -469,27 +469,27 @@ Lists must use the [3-level 
representation](https://github.com/apache/parquet-fo
 | Type               | ORC type    | Notes                                     
                                              |
 
|--------------------|-------------|-----------------------------------------------------------------------------------------|
 | **`boolean`**      | `boolean`   |                                           
                                              |
-| **`int`**          | `int`       | ORC tinyint and smallint would map to int 
also.                                         |
+| **`int`**          | `int`       | ORC `tinyint` and `smallint` would also 
map to **`int`**.                               |
 | **`long`**         | `long`      |                                           
                                              |
 | **`float`**        | `float`     |                                           
                                              |
 | **`double`**       | `double`    |                                           
                                              |
 | **`decimal(P,S)`** | `decimal`   |                                           
                                              |
 | **`date`**         | `date`      |                                           
                                              |
-| **`time`**         | `int`       | Stores microseconds from midnight         
                                              |
+| **`time`**         | `int`       | Stores microseconds from midnight.        
                                              |
 | **`timestamp`**    | `timestamp` |                                           
                                              |
-| **`timestamptz`**  | `struct`    | We should add this to ORC’s type model. 
(ORC-294)                                       |
-| **`string`**       | `string`    | ORC varchar and char would map to Iceberg 
string too.                                   |
+| **`timestamptz`**  | `struct`    | We should add this to ORC’s type model 
(ORC-294).                                       |
+| **`string`**       | `string`    | ORC `varchar` and `char` would also map 
to **`string`**.                                |
 | **`uuid`**         | `binary`    |                                           
                                              |
-| **`fixed(L)`**     | `binary`    | The length would not be checked by the 
ORC reader and should be checked by the adaptor. |
+| **`fixed(L)`**     | `binary`    | The length would not be checked by the 
ORC reader and should be checked by the adapter. |
 | **`binary`**       | `binary`    |                                           
                                              |
-| **`struct`**       | `struct`    | ORC uniontype would map to struct also.   
                                              |
+| **`struct`**       | `struct`    | ORC `uniontype` would also map to 
**`struct`**.                                         |
 | **`list`**         | `array`     |                                           
                                              |
 | **`map`**          | `map`       |                                           
                                              |
 
 
-One of the interesting challenges with this is how to map Iceberg’s schema 
evolution (id based) on to ORC’s (name based). In theory we could use Iceberg’s 
column ids as the column and field names, but that would suck from a user’s 
point of view. 
+One of the interesting challenges with this is how to map Iceberg’s schema 
evolution (id based) on to ORC’s (name based). In theory, we could use 
Iceberg’s column ids as the column and field names, but that would suck from a 
user’s point of view. 
 
-The column ids would be stored in ORC’s user metadata as “iceberg.column.id” 
with a comma separated list of the ids.
+The column ids would be stored in ORC’s user metadata as `iceberg.column.id` 
with a comma separated list of the ids.
 
 Iceberg would build the desired reader schema with their schema evolution 
rules and pass that down to the ORC reader, which would then use its schema 
evolution to map that to the writer’s schema. Basically, Iceberg would need to 
change the names of columns and fields to get the desired mapping.

[incubator-iceberg] branch master updated: Fix spelling, grammar, and formatting in the spec (#308)

Reply via email to