[
https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roger Meier resolved THRIFT-111.
--------------------------------
Resolution: Won't Fix
issue is too old, please reopen or create a new issue and patch if you need
this.
see http://thrift.apache.org/docs/HowToContribute/
> TRecordStream: a robust transport for writing records with (optional)
> CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: THRIFT-111
> URL: https://issues.apache.org/jira/browse/THRIFT-111
> Project: Thrift
> Issue Type: New Feature
> Reporter: Pete Wyckoff
> Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc
> circulated on the public thrift lists under the name TRobustOfflineStream in
> May 08 with the addition of the requirement of handling small synchronous
> writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible. A file containing a TRecordStream
> must contain enough metadata for an application to read it with no other
> context. It should be possible to add new features without breaking
> backwards and forwards compatibility. It should be possible to completely
> change the format without confusing old or programs.
> - Be robust against disk corruption. All data and metadata must (optionally)
> be checksummed. It must be possible to recover and continue reading
> uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable. TRecordStream will also be used for
> plan-text, line-oriented, human-readable data. Allowing a plain-text,
> line-oriented, human-readable header format will be advantageous for this
> use case.
> - Support asynchronous file I/O. This feature will not be implemented in the
> first version of TRecordStream, but the implementation must support
> the eventual inclusion of this feature.
> - Be performant. No significant sacrifice of speed should be made in order to
> achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer. The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption. If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character. The full
> version line shall be repeated.
> This document describes version 1 of the format. Version 1 streams are
> composed of series of chunks. Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult. Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
> structure (whose exact structure is defined in
> robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
> unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
> - An entry name, which is an ASCII string consisting of alphanumeric
> characters, dashes ("-"), underscores, and periods (full-stops).
> - A colon followed by a space.
> - An entry value, which is a printable ASCII string not including any
> linefeeds.
> - A linefeed.
> - A linefeed.
> Header entry names may be repeated. The handling of repeated names is
> dependent on the particular name. Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed. (The chunk header will define the compression format used.) The
> chunk header will specify the following fields for each sub-chunk:
> - (optional) Offset within the chunk. If ommitted, it should be assumed to
> immediately follow the previous sub-chunk.
> - (required) Length of the (optionally) compressed sub-chunk. This is the
> physical number of bytes in the stream taken up by the sub-cunk.
> - (optional) Uncompressed length of the sub-chunk. Used as an optimization
> hint.
> - (optional) CRC-32 of the (optionally compressed) sub-chunk.
> - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp facebook.thrift.transport.record_stream
> namespace java com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
> * enums in plain-text headers should be represented as strings, not numbers.
> * Each enum value should specify the string used in plain text.
> */
> enum CompressionType {
> /**
> * "raw": No compression.
> *
> * The data written to the TRecordStream object appears byte-for-byte
> * in the stream. Raw format streams ignore the uncompressed length and
> * uncompressed checksum of the sub-chunks. It is strongly advised to use
> * checksums when writing raw sub-chunks.
> */
> COMPRESSION_RAW = 0,
> /**
> * "zlib": zlib compression.
> *
> * The compressed data is a zlib stream compressed with the "deflate"
> * algorithm. This format is specified by RFCs 1950 and 1951, and is
> * produced by zlib's "compress" or "deflate" functions. Note that this is
> * *not* a raw "deflate" stream nor a gzip file.
> */
> COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
> /**
> * (Absent in plain text.) Unspecified record type.
> */
> RECORD_UNKNOWN = 0,
> /**
> * "struct": Thrift structures, serialized back-to-back.
> */
> RECORD_STRUCT = 1,
> /**
> * "call": Thrift method calls, produced by send_method();
> */
> RECORD_CALL = 2,
> /**
> * "lines": Line-oriented text data.
> */
> RECORD_LINES = 3,
> }
> enum ProtocolType {
> /** (Absent in plain text.) */
> PROTOCOL_UNKNOWN = 0;
> /** "binary" */
> PROTOCOL_BINARY = 1;
> /** "dense" */
> PROTOCOL_DENSE = 2;
> /** "json" */
> PROTOCOL_JSON = 3;
> /** "simple_json" */
> PROTOCOL_SIMPLE_JSON = 4;
> /** "csv" */
> PROTOCOL_CSV = 5;
> }
> /**
> * The structure used to represent metadata about a sub-chunk.
> * In plain text, this structure is included as the value of a "Sub-Chunk"
> * header entry. Each of these fields should be included, represented
> * according to the comment for ChunkHeader. Fields should be in order and
> * separated by a single space. Absent fields should be included as a single
> * dash ("-").
> */
> struct SubChunkHeader {
> 1: optional i32 offset;
> 2: required i32 length;
> 3: optional i32 checksum;
> 4: optional i32 uncompressed_length;
> 5: optional i32 uncompressed_checksum;
> }
> /**
> * This is the top-level structure encoded as the chunk header.
> * Unless otherwise specified, field will be represented in plain text by
> * uppercasing each word in the field name and replacing underscores with
> * hyphens, producing the field name. Integers should be ASCII-encoded
> * decimal, except for checksums which should be ASCII-encoded hexadecimal
> * unsigned.
> */
> struct ChunkHeader {
> /**
> * Number of bytes per chunk.
> * Recommended to be a power of 2.
> */
> 1: required i32 chunk_size;
> /**
> * Type of compression used for sub-chunks.
> * Assumed to be RAW if absent.
> */
> 3: optional CompressionType compression_type = COMPRESSION_RAW;
> /**
> * Type of records encoded in the sub-chunks.
> * This information is made accessible to applications,
> * but is otherwise uninterpreted by the transport.
> */
> 4: optional RecordType record_type = RECORD_UNKNOWN;
> /**
> * Protocol used for serializing records.
> * This information is made accessible to applications,
> * but is otherwise uninterpreted by the transport.
> */
> 5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
> /**
> * The metadata for the individual sub-chunks,
> * in the order they should be read.
> *
> * In the plain-text format, each of these is written as a separate
> * "Sub-Chunk" header entry, in order.
> */
> 2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira