Repository: parquet-format Updated Branches: refs/heads/master 84460c5a1 -> 3b04d86a2
PARQUET-1031: Fix spelling errors, whitespace, GitHub urls rebased pull request https://github.com/apache/parquet-format/pull/39, fixed minor spelling mistake and the travis-ci URLs (which also pointed to the Parquet/parquet-format one). @Mistobaan please let me know if you would like to reclaim the original pull request. Author: Fabrizio (Misto) Milo <[email protected]> Author: Anna Szonyi <[email protected]> Closes #59 from commanderofthegrey/parquet-1031 and squashes the following commits: e61c3b5 [Anna Szonyi] add back uncompressed_page_size 1cb8163 [Anna Szonyi] PARQUET-1031: Fix spelling errors, whitespace, GitHub urls 67f0064 [Fabrizio (Misto) Milo] explicit that the length has no sign e901ded [Fabrizio (Misto) Milo] fix misspells ceda268 [Fabrizio (Misto) Milo] remove spaces e1f9479 [Fabrizio (Misto) Milo] fix mispell Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/3b04d86a Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/3b04d86a Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/3b04d86a Branch: refs/heads/master Commit: 3b04d86a251fc9409b65df80c79c24de24298e1a Parents: 84460c5 Author: Fabrizio (Misto) Milo <[email protected]> Authored: Wed Oct 11 08:53:21 2017 -0700 Committer: Ryan Blue <[email protected]> Committed: Wed Oct 11 08:53:21 2017 -0700 ---------------------------------------------------------------------- Encodings.md | 42 ++++++++-------- README.md | 99 +++++++++++++++++++------------------ src/main/thrift/parquet.thrift | 40 +++++++-------- 3 files changed, 91 insertions(+), 90 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/parquet-format/blob/3b04d86a/Encodings.md ---------------------------------------------------------------------- diff --git a/Encodings.md b/Encodings.md index 7cf880e..0450588 100644 --- a/Encodings.md +++ b/Encodings.md @@ -27,9 +27,9 @@ This file contains the specification of all supported encodings. Supported Types: all This is the plain encoding that must be supported for types. It is -intended to be the simplest encoding. Values are encoded back to back. +intended to be the simplest encoding. Values are encoded back to back. -The plain encoding is used whenever a more efficient encoding can not be used. It +The plain encoding is used whenever a more efficient encoding can not be used. It stores the data in the following format: - BOOLEAN: [Bit Packed](#RLE), LSB first - INT32: 4 bytes little endian @@ -41,16 +41,16 @@ stores the data in the following format: - FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this outputs the data as little endian. Floating - point types are encoded in IEEE. + point types are encoded in IEEE. For the byte array type, it encodes the length as a 4 byte little endian, followed by the bytes. ### Dictionary Encoding (PLAIN_DICTIONARY = 2) -The dictionary encoding builds a dictionary of values encountered in a given column. The +The dictionary encoding builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integers using the [RLE/Bit-Packing Hybrid](#RLE) encoding. If the dictionary grows too big, whether in size -or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is +or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk. Dictionary page format: the entries in the dictionary - in dictionary order - using the [plain](#PLAIN) encoding. @@ -64,16 +64,16 @@ This encoding uses a combination of bit-packing and run length encoding to more The grammar for this encoding looks like this, given a fixed bit-width known in advance: ``` rle-bit-packed-hybrid: <length> <encoded-data> -length := length of the <encoded-data> in bytes stored as 4 bytes little endian +length := length of the <encoded-data> in bytes stored as 4 bytes little endian (unsigned int32) encoded-data := <run>* -run := <bit-packed-run> | <rle-run> -bit-packed-run := <bit-packed-header> <bit-packed-values> -bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1) -// we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8 -bit-pack-count := (number of values in this run) / 8 -bit-packed-values := *see 1 below* -rle-run := <rle-header> <repeated-value> -rle-header := varint-encode( (number of times repeated) << 1) +run := <bit-packed-run> | <rle-run> +bit-packed-run := <bit-packed-header> <bit-packed-values> +bit-packed-header := varint-encode(<bit-pack-count> << 1 | 1) +// we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8 +bit-pack-count := (number of values in this run) / 8 +bit-packed-values := *see 1 below* +rle-run := <rle-header> <repeated-value> +rle-header := varint-encode( (number of times repeated) << 1) repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width) ``` @@ -82,14 +82,14 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex though the order of the bits in each value remains in the usual order of most significant to least significant. For example, to pack the same values as the example in the deprecated encoding above: - The numbers 1 through 7 using bit width 3: + The numbers 1 through 7 using bit width 3: ``` dec value: 0 1 2 3 4 5 6 7 bit value: 000 001 010 011 100 101 110 111 bit label: ABC DEF GHI JKL MNO PQR STU VWX ``` - - would be encoded like this where spaces mark byte boundaries (3 bytes): + + would be encoded like this where spaces mark byte boundaries (3 bytes): ``` bit value: 10001000 11000110 11111010 bit label: HIDEFABC RMNOJKLG VWXSTUPQ @@ -114,13 +114,13 @@ This implementation is deprecated because the [RLE/bit-packing](#RLE) hybrid is For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit, which is not the same as the [RLE/bit-packing](#RLE) hybrid. -For example, the numbers 1 through 7 using bit width 3: +For example, the numbers 1 through 7 using bit width 3: ``` dec value: 0 1 2 3 4 5 6 7 bit value: 000 001 010 011 100 101 110 111 bit label: ABC DEF GHI JKL MNO PQR STU VWX ``` -would be encoded like this where spaces mark byte boundaries (3 bytes): +would be encoded like this where spaces mark byte boundaries (3 bytes): ``` bit value: 00000101 00111001 01110111 bit label: ABCDEFGH IJKLMNOP QRSTUVWX @@ -141,7 +141,7 @@ The header is defined as follows: * the total value count is stored as a VLQ int * the first value is stored as a zigzag VLQ int -Each block contains +Each block contains ``` <min delta> <list of bitwidths of miniblocks> <miniblocks> ``` @@ -233,4 +233,4 @@ sequence of strings, store the prefix length of the previous entry plus the suff For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding. This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by -the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). +the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). http://git-wip-us.apache.org/repos/asf/parquet-format/blob/3b04d86a/README.md ---------------------------------------------------------------------- diff --git a/README.md b/README.md index 786cddc..c01aec3 100644 --- a/README.md +++ b/README.md @@ -50,11 +50,11 @@ Java resources can be build using `mvn package`. The current stable version shou C++ thrift resources can be generated via make. -Thrift can be also code-genned into any other thrift-supported language. +Thrift can be also code-generated into any other thrift-supported language. ## Glossary - - Block (HDFS block): This means a block in HDFS and the meaning is - unchanged for describing this file format. The file format is + - Block (HDFS block): This means a block in HDFS and the meaning is + unchanged for describing this file format. The file format is designed to work well on top of HDFS. - File: A HDFS file that must include the metadata for the file. @@ -73,7 +73,7 @@ Thrift can be also code-genned into any other thrift-supported language. Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or -more pages. +more pages. ## Unit of parallelization - MapReduce - File/Row Group @@ -101,14 +101,14 @@ This file and the [thrift definition](src/main/thrift/parquet.thrift) should be 4-byte length in bytes of file metadata (little endian) 4-byte magic number "PAR1" -In the above example, there are N columns in this table, split into M row -groups. The file metadata contains the locations of all the column metadata -start locations. More details on what is contained in the metadata can be found +In the above example, there are N columns in this table, split into M row +groups. The file metadata contains the locations of all the column metadata +start locations. More details on what is contained in the metadata can be found in the thrift definition. Metadata is written after the data to allow for single pass writing. -Readers are expected to first read the file metadata to find all the column +Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.  @@ -146,36 +146,37 @@ documented in [logical-types]: LogicalTypes.md ## Nested Encoding -To encode nested columns, Parquet uses the Dremel encoding with definition and -repetition levels. Definition levels specify how many optional fields in the +To encode nested columns, Parquet uses the Dremel encoding with definition and +repetition levels. Definition levels specify how many optional fields in the path for the column are defined. Repetition levels specify at what repeated field in the path has the value repeated. The max definition and repetition levels can be computed from the schema (i.e. how much nesting there is). This defines the maximum number of bits required to store the levels (levels are defined for all -values in the column). +values in the column). Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it supersedes BIT_PACKED. ## Nulls -Nullity is encoded in the definition levels (which is run-length encoded). NULL values -are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs +Nullity is encoded in the definition levels (which is run-length encoded). NULL values +are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and -nothing else. +nothing else. ## Data Pages For data pages, the 3 pieces of information are encoded back to back, after the page -header. We have the - - repetition levels data, - - definition levels data, - - encoded values. +header. +In order we have: + 1. repetition levels data + 1. definition levels data + 1. encoded values -The value of `uncompressed_page_size` specified in the header is for all 3 pieces combined. +The value of `uncompressed_page_size` specified in the header is for all the 3 pieces combined. -The data for the data page is always required. The definition and repetition levels +The encoded values for the data page is always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. the path to the column has length 1), we do not encode the repetition levels (it would always have the value 1). For data that is required, the definition levels are -skipped (if encoded, it will always have the value of the max definition level). +skipped (if encoded, it will always have the value of the max definition level). For example, in the case where the column is non-nested and required, the data in the page is only the encoded values. @@ -183,53 +184,53 @@ page is only the encoded values. The supported encodings are described in [Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md) ## Column chunks -Column chunks are composed of pages written back to back. The pages share a common -header and readers can skip over pages they are not interested in. The data for the -page follows the header and can be compressed and/or encoded. The compression and +Column chunks are composed of pages written back to back. The pages share a common +header and readers can skip over pages they are not interested in. The data for the +page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata. ## Checksumming -Data pages can be individually checksummed. This allows disabling of checksums at the +Data pages can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups. ## Error recovery -If the file metadata is corrupt, the file is lost. If the column metadata is corrupt, -that column chunk is lost (but column chunks for this column in other row groups are -okay). If a page header is corrupt, the remaining pages in that chunk are lost. If -the data within a page is corrupt, that page is lost. The file will be more +If the file metadata is corrupt, the file is lost. If the column metadata is corrupt, +that column chunk is lost (but column chunks for this column in other row groups are +okay). If a page header is corrupt, the remaining pages in that chunk are lost. If +the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups. -Potential extension: With smaller row groups, the biggest issue is placing the file -metadata at the end. If an error happens while writing the file metadata, all the -data written will be unreadable. This can be fixed by writing the file metadata -every Nth row group. -Each file metadata would be cumulative and include all the row groups written so -far. Combining this with the strategy used for rc or avro files using sync markers, -a reader could recover partially written files. +Potential extension: With smaller row groups, the biggest issue is placing the file +metadata at the end. If an error happens while writing the file metadata, all the +data written will be unreadable. This can be fixed by writing the file metadata +every Nth row group. +Each file metadata would be cumulative and include all the row groups written so +far. Combining this with the strategy used for rc or avro files using sync markers, +a reader could recover partially written files. ## Separating metadata and column data. The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata -file reference multiple parquet files. +file reference multiple parquet files. ## Configurations -- Row group size: Larger row groups allow for larger column chunks which makes it -possible to do larger sequential IO. Larger groups also require more buffering in -the write path (or a two pass write). We recommend large row groups (512MB - 1GB). -Since an entire row group might need to be read, we want it to completely fit on -one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An -optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block +- Row group size: Larger row groups allow for larger column chunks which makes it +possible to do larger sequential IO. Larger groups also require more buffering in +the write path (or a two pass write). We recommend large row groups (512MB - 1GB). +Since an entire row group might need to be read, we want it to completely fit on +one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An +optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. -- Data page size: Data pages should be considered indivisible so smaller data pages -allow for more fine grained reading (e.g. single row lookup). Larger page sizes -incur less space overhead (less page headers) and potentially less parsing overhead -(processing headers). Note: for sequential scans, it is not expected to read a page +- Data page size: Data pages should be considered indivisible so smaller data pages +allow for more fine grained reading (e.g. single row lookup). Larger page sizes +incur less space overhead (less page headers) and potentially less parsing overhead +(processing headers). Note: for sequential scans, it is not expected to read a page at a time; this is not the IO chunk. We recommend 8KB for page sizes. ## Extensibility There are many places in the format for compatible extensions: - File Version: The file metadata contains a version. -- Encodings: Encodings are specified by enum and more can be added in the future. +- Encodings: Encodings are specified by enum and more can be added in the future. - Page types: Additional page types can be added and safely skipped. ## Contributing @@ -238,7 +239,7 @@ Changes to this core format definition are proposed and discussed in depth on th ## Code of Conduct -We hold ourselves and the Parquet developer community to a code of conduct as described by [Twitter OSS](https://engineering.twitter.com/opensource): <https://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md>. +We hold ourselves and the Parquet developer community to a code of conduct as described by [Twitter OSS](https://engineering.twitter.com/opensource): <https://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md>. ## License Copyright 2013 Twitter, Cloudera and other contributors. http://git-wip-us.apache.org/repos/asf/parquet-format/blob/3b04d86a/src/main/thrift/parquet.thrift ---------------------------------------------------------------------- diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 38cddc7..f955347 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -84,12 +84,12 @@ enum ConvertedType { * Stored as days since Unix epoch, encoded as the INT32 physical type. * */ - DATE = 6; + DATE = 6; - /** - * A time + /** + * A time * - * The total number of milliseconds since midnight. The value is stored + * The total number of milliseconds since midnight. The value is stored * as an INT32 physical type. */ TIME_MILLIS = 7; @@ -104,11 +104,11 @@ enum ConvertedType { /** * A date/time combination - * + * * Date and time recorded as milliseconds since the Unix epoch. Recorded as * a physical type of INT64. */ - TIMESTAMP_MILLIS = 9; + TIMESTAMP_MILLIS = 9; /** * A date/time combination @@ -119,11 +119,11 @@ enum ConvertedType { TIMESTAMP_MICROS = 10; - /** - * An unsigned integer value. - * - * The number describes the maximum number of meainful data bits in - * the stored value. 8, 16 and 32 bit values are stored using the + /** + * An unsigned integer value. + * + * The number describes the maximum number of meainful data bits in + * the stored value. 8, 16 and 32 bit values are stored using the * INT32 physical type. 64 bit values are stored using the INT64 * physical type. * @@ -147,29 +147,29 @@ enum ConvertedType { INT_32 = 17; INT_64 = 18; - /** + /** * An embedded JSON document - * + * * A JSON document embedded within a single UTF8 column. */ JSON = 19; - /** + /** * An embedded BSON document - * - * A BSON document embedded within a single BINARY column. + * + * A BSON document embedded within a single BINARY column. */ BSON = 20; /** * An interval of time - * + * * This type annotates data stored as a FIXED_LEN_BYTE_ARRAY of length 12 * This data is composed of three separate little endian unsigned * integers. Each stores a component of a duration of time. The first * integer identifies the number of months associated with the duration, * the second identifies the number of days associated with the duration - * and the third identifies the number of milliseconds associated with + * and the third identifies the number of milliseconds associated with * the provided duration. This duration of time is independent of any * particular timezone or date. */ @@ -419,7 +419,7 @@ enum Encoding { */ PLAIN_DICTIONARY = 2; - /** Group packed run length encoding. Usable for definition/reptition levels + /** Group packed run length encoding. Usable for definition/repetition levels * encoding and Booleans (on one bit: 0 is false; 1 is true.) */ RLE = 3; @@ -508,7 +508,7 @@ struct DictionaryPageHeader { } /** - * New page format alowing reading levels without decompressing the data + * New page format allowing reading levels without decompressing the data * Repetition and definition levels are uncompressed * The remaining section containing the data is compressed if is_compressed is true **/
