RobertIndie commented on code in PR #18434: URL: https://github.com/apache/pulsar/pull/18434#discussion_r1041963617
########## site2/docs/schema-understand.md: ########## @@ -48,92 +46,227 @@ Pulsar supports various schema types, which are mainly divided into two categori The following table outlines the primitive types that Pulsar schema supports, and the conversions between **schema types** and **language-specific primitive types**. -| Primitive Type | Description | Java Type| Python Type | Go Type | -|---|---|---|---|---| -| `BOOLEAN` | A binary value | boolean | bool | bool | -| `INT8` | A 8-bit signed integer | int | | int8 | -| `INT16` | A 16-bit signed integer | int | | int16 | -| `INT32` | A 32-bit signed integer | int | | int32 | -| `INT64` | A 64-bit signed integer | int | | int64 | -| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number | float | float | float32 | -| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number | double | float | float64| -| `BYTES` | A sequence of 8-bit unsigned bytes | byte[], ByteBuffer, ByteBuf | bytes | []byte | -| `STRING` | A Unicode character sequence | string | str | string| -| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value | java.sql.Timestamp (java.sql.Time, java.util.Date) | | | -| INSTANT | A single instantaneous point on the time-line with nanoseconds precision| java.time.Instant | | | -| LOCAL_DATE | An immutable date-time object that represents a date, often viewed as year-month-day| java.time.LocalDate | | | -| LOCAL_TIME | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision.| java.time.LocalDateTime | | -| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second | java.time.LocalTime | | +| Primitive Type | Description | Java Type| Python Type | Go Type | C++ Type | C# Type| +|---|---|---|---|---|---|---| +| `BOOLEAN` | A binary value. | boolean | bool | bool | bool | bool | +| `INT8` | A 8-bit signed integer. | int | int | int8 | int8_t | byte | +| `INT16` | A 16-bit signed integer. | int | int | int16 | int16_t | short | +| `INT32` | A 32-bit signed integer. | int | int | int32 | int32_t | int | +| `INT64` | A 64-bit signed integer. | int | int | int64 | int64_t | long | +| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number. | float | float | float32 | float | float | +| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number. | double | double | float64| double | double | +| `BYTES` | A sequence of 8-bit unsigned bytes. | byte[], ByteBuffer, ByteBuf | bytes | []byte | void * | byte[], `ReadOnlySequence<byte>` | Review Comment: ```suggestion | `BYTES` | A sequence of 8-bit unsigned bytes. | byte[], ByteBuffer, ByteBuf | bytes | []byte | void * | byte[], ReadOnlySequence<byte> | ``` Looks wired in the preview: <img width="462" alt="image" src="https://user-images.githubusercontent.com/16974619/206140861-5599dec8-87c1-4dc2-86bd-acd61f469217.png"> ########## site2/docs/schema-understand.md: ########## @@ -48,92 +46,227 @@ Pulsar supports various schema types, which are mainly divided into two categori The following table outlines the primitive types that Pulsar schema supports, and the conversions between **schema types** and **language-specific primitive types**. -| Primitive Type | Description | Java Type| Python Type | Go Type | -|---|---|---|---|---| -| `BOOLEAN` | A binary value | boolean | bool | bool | -| `INT8` | A 8-bit signed integer | int | | int8 | -| `INT16` | A 16-bit signed integer | int | | int16 | -| `INT32` | A 32-bit signed integer | int | | int32 | -| `INT64` | A 64-bit signed integer | int | | int64 | -| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number | float | float | float32 | -| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number | double | float | float64| -| `BYTES` | A sequence of 8-bit unsigned bytes | byte[], ByteBuffer, ByteBuf | bytes | []byte | -| `STRING` | A Unicode character sequence | string | str | string| -| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value | java.sql.Timestamp (java.sql.Time, java.util.Date) | | | -| INSTANT | A single instantaneous point on the time-line with nanoseconds precision| java.time.Instant | | | -| LOCAL_DATE | An immutable date-time object that represents a date, often viewed as year-month-day| java.time.LocalDate | | | -| LOCAL_TIME | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision.| java.time.LocalDateTime | | -| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second | java.time.LocalTime | | +| Primitive Type | Description | Java Type| Python Type | Go Type | C++ Type | C# Type| +|---|---|---|---|---|---|---| Review Comment: Do we also need to explain what `N/A` means here? ########## site2/docs/schema-understand.md: ########## @@ -48,92 +46,227 @@ Pulsar supports various schema types, which are mainly divided into two categori The following table outlines the primitive types that Pulsar schema supports, and the conversions between **schema types** and **language-specific primitive types**. -| Primitive Type | Description | Java Type| Python Type | Go Type | -|---|---|---|---|---| -| `BOOLEAN` | A binary value | boolean | bool | bool | -| `INT8` | A 8-bit signed integer | int | | int8 | -| `INT16` | A 16-bit signed integer | int | | int16 | -| `INT32` | A 32-bit signed integer | int | | int32 | -| `INT64` | A 64-bit signed integer | int | | int64 | -| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number | float | float | float32 | -| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number | double | float | float64| -| `BYTES` | A sequence of 8-bit unsigned bytes | byte[], ByteBuffer, ByteBuf | bytes | []byte | -| `STRING` | A Unicode character sequence | string | str | string| -| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value | java.sql.Timestamp (java.sql.Time, java.util.Date) | | | -| INSTANT | A single instantaneous point on the time-line with nanoseconds precision| java.time.Instant | | | -| LOCAL_DATE | An immutable date-time object that represents a date, often viewed as year-month-day| java.time.LocalDate | | | -| LOCAL_TIME | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision.| java.time.LocalDateTime | | -| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second | java.time.LocalTime | | +| Primitive Type | Description | Java Type| Python Type | Go Type | C++ Type | C# Type| +|---|---|---|---|---|---|---| +| `BOOLEAN` | A binary value. | boolean | bool | bool | bool | bool | +| `INT8` | A 8-bit signed integer. | int | int | int8 | int8_t | byte | +| `INT16` | A 16-bit signed integer. | int | int | int16 | int16_t | short | +| `INT32` | A 32-bit signed integer. | int | int | int32 | int32_t | int | +| `INT64` | A 64-bit signed integer. | int | int | int64 | int64_t | long | +| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number. | float | float | float32 | float | float | +| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number. | double | double | float64| double | double | +| `BYTES` | A sequence of 8-bit unsigned bytes. | byte[], ByteBuffer, ByteBuf | bytes | []byte | void * | byte[], `ReadOnlySequence<byte>` | +| `STRING` | An Unicode character sequence. | string | str | string| std::string | string | +| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value. | java.sql.Timestamp (java.sql.Time, java.util.Date) | N/A | N/A | N/A | DateTime,TimeSpan | +| `INSTANT`| A single instantaneous point on the timeline with nanoseconds precision. | java.time.Instant | N/A | N/A | N/A | N/A | +| `LOCAL_DATE` | An immutable date-time object that represents a date, often viewed as year-month-day. | java.time.LocalDate | N/A | N/A | N/A | N/A | +| `LOCAL_TIME` | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision. | java.time.LocalDateTime | N/A | N/A | N/A | N/A | +| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second. | java.time.LocalTime | N/A | N/A | N/A | N/A | + +:::note -For primitive types, Pulsar does not store any schema data in `SchemaInfo`. The `type` in `SchemaInfo` determines how to serialize and deserialize the data. +Pulsar does not store any schema data in `SchemaInfo` for primitive types. Some of the primitive schema implementations can use the `properties` parameter to store implementation-specific tunable settings. For example, a string schema can use `properties` to store the encoding charset to serialize and deserialize strings. -Some of the primitive schema implementations can use `properties` to store implementation-specific tunable settings. For example, a `string` schema can use `properties` to store the encoding charset to serialize and deserialize strings. +::: -For more instructions, see [Construct a string schema](schema-get-started.md#construct-a-string-schema). +For more instructions and examples, see [Construct a string schema](schema-get-started.md#string). ### Complex type -Currently, Pulsar supports the following complex types: +The following table outlines the complex types that Pulsar schema supports: | Complex Type | Description | |---|---| -| `KeyValue` | Represents a complex type of a key/value pair. | -| `Struct` | Handles structured data. It supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. | +| `Keyvalue` | Represents a complex key/value pair. | Review Comment: ```suggestion | `KeyValue` | Represents a complex key/value pair. | ``` ########## site2/docs/schema-understand.md: ########## @@ -48,92 +46,227 @@ Pulsar supports various schema types, which are mainly divided into two categori The following table outlines the primitive types that Pulsar schema supports, and the conversions between **schema types** and **language-specific primitive types**. -| Primitive Type | Description | Java Type| Python Type | Go Type | -|---|---|---|---|---| -| `BOOLEAN` | A binary value | boolean | bool | bool | -| `INT8` | A 8-bit signed integer | int | | int8 | -| `INT16` | A 16-bit signed integer | int | | int16 | -| `INT32` | A 32-bit signed integer | int | | int32 | -| `INT64` | A 64-bit signed integer | int | | int64 | -| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number | float | float | float32 | -| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number | double | float | float64| -| `BYTES` | A sequence of 8-bit unsigned bytes | byte[], ByteBuffer, ByteBuf | bytes | []byte | -| `STRING` | A Unicode character sequence | string | str | string| -| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value | java.sql.Timestamp (java.sql.Time, java.util.Date) | | | -| INSTANT | A single instantaneous point on the time-line with nanoseconds precision| java.time.Instant | | | -| LOCAL_DATE | An immutable date-time object that represents a date, often viewed as year-month-day| java.time.LocalDate | | | -| LOCAL_TIME | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision.| java.time.LocalDateTime | | -| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second | java.time.LocalTime | | +| Primitive Type | Description | Java Type| Python Type | Go Type | C++ Type | C# Type| +|---|---|---|---|---|---|---| +| `BOOLEAN` | A binary value. | boolean | bool | bool | bool | bool | +| `INT8` | A 8-bit signed integer. | int | int | int8 | int8_t | byte | +| `INT16` | A 16-bit signed integer. | int | int | int16 | int16_t | short | +| `INT32` | A 32-bit signed integer. | int | int | int32 | int32_t | int | +| `INT64` | A 64-bit signed integer. | int | int | int64 | int64_t | long | +| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number. | float | float | float32 | float | float | +| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number. | double | double | float64| double | double | +| `BYTES` | A sequence of 8-bit unsigned bytes. | byte[], ByteBuffer, ByteBuf | bytes | []byte | void * | byte[], `ReadOnlySequence<byte>` | +| `STRING` | An Unicode character sequence. | string | str | string| std::string | string | +| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value. | java.sql.Timestamp (java.sql.Time, java.util.Date) | N/A | N/A | N/A | DateTime,TimeSpan | +| `INSTANT`| A single instantaneous point on the timeline with nanoseconds precision. | java.time.Instant | N/A | N/A | N/A | N/A | +| `LOCAL_DATE` | An immutable date-time object that represents a date, often viewed as year-month-day. | java.time.LocalDate | N/A | N/A | N/A | N/A | +| `LOCAL_TIME` | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision. | java.time.LocalDateTime | N/A | N/A | N/A | N/A | +| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second. | java.time.LocalTime | N/A | N/A | N/A | N/A | + +:::note -For primitive types, Pulsar does not store any schema data in `SchemaInfo`. The `type` in `SchemaInfo` determines how to serialize and deserialize the data. +Pulsar does not store any schema data in `SchemaInfo` for primitive types. Some of the primitive schema implementations can use the `properties` parameter to store implementation-specific tunable settings. For example, a string schema can use `properties` to store the encoding charset to serialize and deserialize strings. -Some of the primitive schema implementations can use `properties` to store implementation-specific tunable settings. For example, a `string` schema can use `properties` to store the encoding charset to serialize and deserialize strings. +::: -For more instructions, see [Construct a string schema](schema-get-started.md#construct-a-string-schema). +For more instructions and examples, see [Construct a string schema](schema-get-started.md#string). ### Complex type -Currently, Pulsar supports the following complex types: +The following table outlines the complex types that Pulsar schema supports: | Complex Type | Description | |---|---| -| `KeyValue` | Represents a complex type of a key/value pair. | -| `Struct` | Handles structured data. It supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. | +| `Keyvalue` | Represents a complex key/value pair. | +| `Struct` | Represents structured data, including `AvroBaseStructSchema`, `ProtobufNativeSchema` and `Schema.NATIVE_AVRO`. | #### `KeyValue` schema -`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of key schema and the value schema together. +`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of the key schema and the value schema together. -You can choose the encoding type when constructing the key/value schema.: +Pulsar provides the following methods to encode a **single** key/value pair in a message: * `INLINE` - Key/value pairs are encoded together in the message payload. Review Comment: ```suggestion * `INLINE` - Key/Value pairs are encoded together in the message payload. ``` Although it's not related to this PR, better to keep it consistent. ########## site2/docs/schema-understand.md: ########## @@ -48,92 +46,227 @@ Pulsar supports various schema types, which are mainly divided into two categori The following table outlines the primitive types that Pulsar schema supports, and the conversions between **schema types** and **language-specific primitive types**. -| Primitive Type | Description | Java Type| Python Type | Go Type | -|---|---|---|---|---| -| `BOOLEAN` | A binary value | boolean | bool | bool | -| `INT8` | A 8-bit signed integer | int | | int8 | -| `INT16` | A 16-bit signed integer | int | | int16 | -| `INT32` | A 32-bit signed integer | int | | int32 | -| `INT64` | A 64-bit signed integer | int | | int64 | -| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number | float | float | float32 | -| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number | double | float | float64| -| `BYTES` | A sequence of 8-bit unsigned bytes | byte[], ByteBuffer, ByteBuf | bytes | []byte | -| `STRING` | A Unicode character sequence | string | str | string| -| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value | java.sql.Timestamp (java.sql.Time, java.util.Date) | | | -| INSTANT | A single instantaneous point on the time-line with nanoseconds precision| java.time.Instant | | | -| LOCAL_DATE | An immutable date-time object that represents a date, often viewed as year-month-day| java.time.LocalDate | | | -| LOCAL_TIME | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision.| java.time.LocalDateTime | | -| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second | java.time.LocalTime | | +| Primitive Type | Description | Java Type| Python Type | Go Type | C++ Type | C# Type| +|---|---|---|---|---|---|---| +| `BOOLEAN` | A binary value. | boolean | bool | bool | bool | bool | +| `INT8` | A 8-bit signed integer. | int | int | int8 | int8_t | byte | +| `INT16` | A 16-bit signed integer. | int | int | int16 | int16_t | short | +| `INT32` | A 32-bit signed integer. | int | int | int32 | int32_t | int | +| `INT64` | A 64-bit signed integer. | int | int | int64 | int64_t | long | +| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number. | float | float | float32 | float | float | +| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number. | double | double | float64| double | double | +| `BYTES` | A sequence of 8-bit unsigned bytes. | byte[], ByteBuffer, ByteBuf | bytes | []byte | void * | byte[], `ReadOnlySequence<byte>` | +| `STRING` | An Unicode character sequence. | string | str | string| std::string | string | +| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value. | java.sql.Timestamp (java.sql.Time, java.util.Date) | N/A | N/A | N/A | DateTime,TimeSpan | +| `INSTANT`| A single instantaneous point on the timeline with nanoseconds precision. | java.time.Instant | N/A | N/A | N/A | N/A | +| `LOCAL_DATE` | An immutable date-time object that represents a date, often viewed as year-month-day. | java.time.LocalDate | N/A | N/A | N/A | N/A | +| `LOCAL_TIME` | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision. | java.time.LocalDateTime | N/A | N/A | N/A | N/A | +| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second. | java.time.LocalTime | N/A | N/A | N/A | N/A | + +:::note -For primitive types, Pulsar does not store any schema data in `SchemaInfo`. The `type` in `SchemaInfo` determines how to serialize and deserialize the data. +Pulsar does not store any schema data in `SchemaInfo` for primitive types. Some of the primitive schema implementations can use the `properties` parameter to store implementation-specific tunable settings. For example, a string schema can use `properties` to store the encoding charset to serialize and deserialize strings. -Some of the primitive schema implementations can use `properties` to store implementation-specific tunable settings. For example, a `string` schema can use `properties` to store the encoding charset to serialize and deserialize strings. +::: -For more instructions, see [Construct a string schema](schema-get-started.md#construct-a-string-schema). +For more instructions and examples, see [Construct a string schema](schema-get-started.md#string). ### Complex type -Currently, Pulsar supports the following complex types: +The following table outlines the complex types that Pulsar schema supports: | Complex Type | Description | |---|---| -| `KeyValue` | Represents a complex type of a key/value pair. | -| `Struct` | Handles structured data. It supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. | +| `Keyvalue` | Represents a complex key/value pair. | +| `Struct` | Represents structured data, including `AvroBaseStructSchema`, `ProtobufNativeSchema` and `Schema.NATIVE_AVRO`. | #### `KeyValue` schema -`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of key schema and the value schema together. +`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of the key schema and the value schema together. -You can choose the encoding type when constructing the key/value schema.: +Pulsar provides the following methods to encode a **single** key/value pair in a message: * `INLINE` - Key/value pairs are encoded together in the message payload. -* `SEPARATED` - see [Construct a key/value schema](schema-get-started.md#construct-a-keyvalue-schema). +* `SEPARATED` - The Key is stored as a message key, while the value is stored as the message payload. See [Construct a key/value schema](schema-get-started.md#keyvalue) for more details. #### `Struct` schema -`Struct` schema supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. +The following table outlines the `struct` types that Pulsar schema supports: |Type|Description| ---|---| -`AvroBaseStructSchema`|Pulsar uses [Avro Specification](http://avro.apache.org/docs/current/spec.html) to declare the schema definition for `AvroBaseStructSchema`, which supports `AvroSchema`, `JsonSchema`, and `ProtobufSchema`. <br /><br />This allows Pulsar:<br />- to use the same tools to manage schema definitions<br />- to use different serialization or deserialization methods to handle data| -`ProtobufNativeSchema`|`ProtobufNativeSchema` is based on protobuf native Descriptor. <br /><br />This allows Pulsar:<br />- to use native protobuf-v3 to serialize or deserialize data<br />- to use `AutoConsume` to deserialize data. +`AvroBaseStructSchema`|Pulsar uses [Avro Specification](http://avro.apache.org/docs/current/spec.html) to declare the schema definition for `AvroBaseStructSchema`, which supports `AvroSchema`, `JsonSchema`, and `ProtobufSchema`. <br /><br />This allows Pulsar to:<br />- use the same tools to manage schema definitions.<br />- use different serialization or deserialization methods to handle data. | +`ProtobufNativeSchema`|`ProtobufNativeSchema` is based on protobuf native Descriptor. <br /><br />This allows Pulsar to:<br />- use native protobuf-v3 to serialize or deserialize data<br />- use `AutoConsume` to deserialize data.| +`Schema.NATIVE_AVRO` | `Schema.NATIVE_AVRO` is used to wrap a native Avro schema type `org.apache.avro.Schema`. The result is a schema instance that accepts a serialized Avro payload without validating it against the wrapped Avro schema. <br /><br />When you migrate or ingest event or messaging data from external systems (such as Kafka and Cassandra), the data is often already serialized in Avro format. The applications producing the data typically have validated the data against their schemas (including compatibility checks) and stored them in a database or a dedicated service (such as schema registry). The schema of each serialized data record is usually retrievable by some metadata attached to that record. In such cases, a Pulsar producer doesn't need to repeat the schema validation when sending the ingested events to a topic. All it needs to do is pass each message or event with its schema to Pulsar. | Pulsar provides the following methods to use the `struct` schema. * `static` * `generic` * `SchemaDefinition` -For more examples, see [Construct a struct schema](schema-get-started.md#construct-a-struct-schema). +This example shows how to construct a `struct` schema with these methods and use it to produce and consume messages. + +````mdx-code-block +<Tabs + defaultValue="static" + values={[{"label":"static","value":"static"},{"label":"generic","value":"generic"},{"label":"SchemaDefinition","value":"SchemaDefinition"}]}> + +<TabItem value="static"> + +You can predefine the `struct` schema, which can be a POJO in Java, a `struct` in Go, or classes generated by Avro or Protobuf tools. + +**Example** + +Pulsar gets the schema definition from the predefined `struct` using an Avro library. The schema definition is the schema data stored as a part of the `SchemaInfo`. + +1. Create the _User_ class to define the messages sent to Pulsar topics. + + ```java + public static class User { + public String name; + public int age; + public User(String name, int age) { + this.name = name; + this.age = age + } + public User() {} + } + ``` + +2. Create a producer with a `struct` schema and send messages. + + ```java + Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create(); + producer.newMessage().value(new User("pulsar-user", 1)).send(); + ``` + +3. Create a consumer with a `struct` schema and receive messages + + ```java + Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).subscribe(); + User user = consumer.receive().getValue(); + ``` + +</TabItem> +<TabItem value="generic"> + +Sometimes applications do not have pre-defined structs, and you can use this method to define schema and access data. + +You can define the `struct` schema using the `GenericSchemaBuilder`, generate a generic struct using `GenericRecordBuilder`, and consume messages into `GenericRecord`. + +**Example** + +1. Use `RecordSchemaBuilder` to build a schema. + + ```java + RecordSchemaBuilder recordSchemaBuilder = SchemaBuilder.record("schemaName"); + recordSchemaBuilder.field("intField").type(SchemaType.INT32); + SchemaInfo schemaInfo = recordSchemaBuilder.build(SchemaType.AVRO); + + Consumer<GenericRecord> consumer = client.newConsumer(Schema.generic(schemaInfo)) + .topic(topicName) + .subscriptionName(subscriptionName) + .subscribe(); + Producer<GenericRecord> producer = client.newProducer(Schema.generic(schemaInfo)) + .topic(topicName) + .create(); + ``` + +2. Use `RecordBuilder` to build the struct records. + + ```java + GenericSchemaImpl schema = GenericAvroSchema.of(schemaInfo); + // send message + GenericRecord record = schema.newRecordBuilder().set("intField", 32).build(); + producer.newMessage().value(record).send(); + // receive message + Message<GenericRecord> msg = consumer.receive(); + + Assert.assertEquals(msg.getValue().getField("intField"), 32); + ``` + +</TabItem> +<TabItem value="SchemaDefinition"> + +You can define the `schemaDefinition` to generate a `struct` schema. + +**Example** + +1. Create the _User_ class to define the messages sent to Pulsar topics. + + ```java + public static class User { + public String name; + public int age; + public User(String name, int age) { + this.name = name; + this.age = age + } + public User() {} + } + ``` + +2. Create a producer with a `SchemaDefinition` and send messages. + + ```java + SchemaDefinition<User> schemaDefinition = SchemaDefinition.<User>builder().withPojo(User.class).build(); + Producer<User> producer = client.newProducer(Schema.AVRO(schemaDefinition)).create(); + producer.newMessage().value(new User ("pulsar-user", 1)).send(); + ``` + +3. Create a consumer with a `SchemaDefinition` schema and receive messages. + + ```java + SchemaDefinition<User> schemaDefinition = SchemaDefinition.<User>builder().withPojo(User.class).build(); + Consumer<User> consumer = client.newConsumer(Schema.AVRO(schemaDefinition)).subscribe(); + User user = consumer.receive().getValue(); + ``` + +</TabItem> +</Tabs> +```` ### Auto Schema -If you don't know the schema type of a Pulsar topic in advance, you can use AUTO schema to produce or consume generic records to or from brokers. +If there is no chance to know the schema type of a Pulsar topic in advance, you can use AUTO schemas to produce/consume generic records to/from brokers. Auto schema contains two categories: -* `AUTO_PRODUCE` transfers data from a producer to a Pulsar topic that has a schema and helps the producer validate whether the out-bound bytes are compatible with the schema of the topic. For more instructions, see [Construct an AUTO_PRODUCE schema](schema-get-started.md#construct-an-auto_produce-schema). -* `AUTO_CONSUME` transfers data from a Pulsar topic that has a schema to a consumer and helps the topic validate whether the out-bound bytes are compatible with the consumer. In other words, the topic deserializes messages into language-specific objects `GenericRecord` using the `SchemaInfo` retrieved from brokers. Currently, `AUTO_CONSUME` supports AVRO, JSON and ProtobufNativeSchema schemas. For more instructions, see [Construct an AUTO_CONSUME schema](schema-get-started.md#construct-an-auto_consume-schema). +* `AUTO_PRODUCE` transfers data from a producer to a Pulsar topic that has a schema and helps the producer validate whether the outbound bytes are compatible with the schema of the topic. For more instructions, see [Construct an AUTO_PRODUCE schema](schema-get-started.md#auto_produce). +* `AUTO_CONSUME` transfers data from a Pulsar topic that has a schema to a consumer and helps the topic validate whether the out-bound bytes are compatible with the consumer. In other words, the topic deserializes messages into language-specific objects `GenericRecord` using the `SchemaInfo` retrieved from brokers. For more instructions, see [Construct an AUTO_CONSUME schema](schema-get-started.md#auto_consume). -### Native Avro Schema +## Schema validation -When migrating or ingesting event or message data from external systems (such as Kafka and Cassandra), the events are often already serialized in Avro format. The applications producing the data typically have validated the data against their schemas (including compatibility checks) and stored them in a database or a dedicated service (such as a schema registry). The schema of each serialized data record is usually retrievable by some metadata attached to that record. In such cases, a Pulsar producer doesn't need to repeat the schema validation step when sending the ingested events to a topic. All it needs to do is passing each message or event with its schema to Pulsar. +Schema validation enables brokers to reject producers/consumers without a schema. Review Comment: This would make users confused. I think this section is only focused on `isSchemaValidationEnforced`, and it's used to reject producers/consumers without schema. But the `Schema Validation` is to validate the schema compatibility, which is covered in the next section. I think we need to change this section to `Schema Validation Enforced`. ########## site2/docs/schema-understand.md: ########## @@ -48,92 +46,227 @@ Pulsar supports various schema types, which are mainly divided into two categori The following table outlines the primitive types that Pulsar schema supports, and the conversions between **schema types** and **language-specific primitive types**. -| Primitive Type | Description | Java Type| Python Type | Go Type | -|---|---|---|---|---| -| `BOOLEAN` | A binary value | boolean | bool | bool | -| `INT8` | A 8-bit signed integer | int | | int8 | -| `INT16` | A 16-bit signed integer | int | | int16 | -| `INT32` | A 32-bit signed integer | int | | int32 | -| `INT64` | A 64-bit signed integer | int | | int64 | -| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number | float | float | float32 | -| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number | double | float | float64| -| `BYTES` | A sequence of 8-bit unsigned bytes | byte[], ByteBuffer, ByteBuf | bytes | []byte | -| `STRING` | A Unicode character sequence | string | str | string| -| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value | java.sql.Timestamp (java.sql.Time, java.util.Date) | | | -| INSTANT | A single instantaneous point on the time-line with nanoseconds precision| java.time.Instant | | | -| LOCAL_DATE | An immutable date-time object that represents a date, often viewed as year-month-day| java.time.LocalDate | | | -| LOCAL_TIME | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision.| java.time.LocalDateTime | | -| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second | java.time.LocalTime | | +| Primitive Type | Description | Java Type| Python Type | Go Type | C++ Type | C# Type| +|---|---|---|---|---|---|---| +| `BOOLEAN` | A binary value. | boolean | bool | bool | bool | bool | +| `INT8` | A 8-bit signed integer. | int | int | int8 | int8_t | byte | +| `INT16` | A 16-bit signed integer. | int | int | int16 | int16_t | short | +| `INT32` | A 32-bit signed integer. | int | int | int32 | int32_t | int | +| `INT64` | A 64-bit signed integer. | int | int | int64 | int64_t | long | +| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number. | float | float | float32 | float | float | +| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number. | double | double | float64| double | double | +| `BYTES` | A sequence of 8-bit unsigned bytes. | byte[], ByteBuffer, ByteBuf | bytes | []byte | void * | byte[], `ReadOnlySequence<byte>` | +| `STRING` | An Unicode character sequence. | string | str | string| std::string | string | +| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value. | java.sql.Timestamp (java.sql.Time, java.util.Date) | N/A | N/A | N/A | DateTime,TimeSpan | +| `INSTANT`| A single instantaneous point on the timeline with nanoseconds precision. | java.time.Instant | N/A | N/A | N/A | N/A | +| `LOCAL_DATE` | An immutable date-time object that represents a date, often viewed as year-month-day. | java.time.LocalDate | N/A | N/A | N/A | N/A | +| `LOCAL_TIME` | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision. | java.time.LocalDateTime | N/A | N/A | N/A | N/A | +| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second. | java.time.LocalTime | N/A | N/A | N/A | N/A | + +:::note -For primitive types, Pulsar does not store any schema data in `SchemaInfo`. The `type` in `SchemaInfo` determines how to serialize and deserialize the data. +Pulsar does not store any schema data in `SchemaInfo` for primitive types. Some of the primitive schema implementations can use the `properties` parameter to store implementation-specific tunable settings. For example, a string schema can use `properties` to store the encoding charset to serialize and deserialize strings. -Some of the primitive schema implementations can use `properties` to store implementation-specific tunable settings. For example, a `string` schema can use `properties` to store the encoding charset to serialize and deserialize strings. +::: -For more instructions, see [Construct a string schema](schema-get-started.md#construct-a-string-schema). +For more instructions and examples, see [Construct a string schema](schema-get-started.md#string). ### Complex type -Currently, Pulsar supports the following complex types: +The following table outlines the complex types that Pulsar schema supports: | Complex Type | Description | |---|---| -| `KeyValue` | Represents a complex type of a key/value pair. | -| `Struct` | Handles structured data. It supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. | +| `Keyvalue` | Represents a complex key/value pair. | +| `Struct` | Represents structured data, including `AvroBaseStructSchema`, `ProtobufNativeSchema` and `Schema.NATIVE_AVRO`. | #### `KeyValue` schema -`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of key schema and the value schema together. +`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of the key schema and the value schema together. -You can choose the encoding type when constructing the key/value schema.: +Pulsar provides the following methods to encode a **single** key/value pair in a message: * `INLINE` - Key/value pairs are encoded together in the message payload. -* `SEPARATED` - see [Construct a key/value schema](schema-get-started.md#construct-a-keyvalue-schema). +* `SEPARATED` - The Key is stored as a message key, while the value is stored as the message payload. See [Construct a key/value schema](schema-get-started.md#keyvalue) for more details. #### `Struct` schema -`Struct` schema supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. +The following table outlines the `struct` types that Pulsar schema supports: |Type|Description| ---|---| -`AvroBaseStructSchema`|Pulsar uses [Avro Specification](http://avro.apache.org/docs/current/spec.html) to declare the schema definition for `AvroBaseStructSchema`, which supports `AvroSchema`, `JsonSchema`, and `ProtobufSchema`. <br /><br />This allows Pulsar:<br />- to use the same tools to manage schema definitions<br />- to use different serialization or deserialization methods to handle data| -`ProtobufNativeSchema`|`ProtobufNativeSchema` is based on protobuf native Descriptor. <br /><br />This allows Pulsar:<br />- to use native protobuf-v3 to serialize or deserialize data<br />- to use `AutoConsume` to deserialize data. +`AvroBaseStructSchema`|Pulsar uses [Avro Specification](http://avro.apache.org/docs/current/spec.html) to declare the schema definition for `AvroBaseStructSchema`, which supports `AvroSchema`, `JsonSchema`, and `ProtobufSchema`. <br /><br />This allows Pulsar to:<br />- use the same tools to manage schema definitions.<br />- use different serialization or deserialization methods to handle data. | +`ProtobufNativeSchema`|`ProtobufNativeSchema` is based on protobuf native Descriptor. <br /><br />This allows Pulsar to:<br />- use native protobuf-v3 to serialize or deserialize data<br />- use `AutoConsume` to deserialize data.| +`Schema.NATIVE_AVRO` | `Schema.NATIVE_AVRO` is used to wrap a native Avro schema type `org.apache.avro.Schema`. The result is a schema instance that accepts a serialized Avro payload without validating it against the wrapped Avro schema. <br /><br />When you migrate or ingest event or messaging data from external systems (such as Kafka and Cassandra), the data is often already serialized in Avro format. The applications producing the data typically have validated the data against their schemas (including compatibility checks) and stored them in a database or a dedicated service (such as schema registry). The schema of each serialized data record is usually retrievable by some metadata attached to that record. In such cases, a Pulsar producer doesn't need to repeat the schema validation when sending the ingested events to a topic. All it needs to do is pass each message or event with its schema to Pulsar. | Review Comment: I think it's better to link it to the Get started chapter for users to read more information. `ProtobufNativeSchema` -> `schema-get-started/#protobufnative` `Schema.NATIVE_AVRO` -> `schema-get-started/#native-avro` ########## site2/docs/schema-understand.md: ########## @@ -48,92 +46,227 @@ Pulsar supports various schema types, which are mainly divided into two categori The following table outlines the primitive types that Pulsar schema supports, and the conversions between **schema types** and **language-specific primitive types**. -| Primitive Type | Description | Java Type| Python Type | Go Type | -|---|---|---|---|---| -| `BOOLEAN` | A binary value | boolean | bool | bool | -| `INT8` | A 8-bit signed integer | int | | int8 | -| `INT16` | A 16-bit signed integer | int | | int16 | -| `INT32` | A 32-bit signed integer | int | | int32 | -| `INT64` | A 64-bit signed integer | int | | int64 | -| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number | float | float | float32 | -| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number | double | float | float64| -| `BYTES` | A sequence of 8-bit unsigned bytes | byte[], ByteBuffer, ByteBuf | bytes | []byte | -| `STRING` | A Unicode character sequence | string | str | string| -| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value | java.sql.Timestamp (java.sql.Time, java.util.Date) | | | -| INSTANT | A single instantaneous point on the time-line with nanoseconds precision| java.time.Instant | | | -| LOCAL_DATE | An immutable date-time object that represents a date, often viewed as year-month-day| java.time.LocalDate | | | -| LOCAL_TIME | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision.| java.time.LocalDateTime | | -| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second | java.time.LocalTime | | +| Primitive Type | Description | Java Type| Python Type | Go Type | C++ Type | C# Type| +|---|---|---|---|---|---|---| +| `BOOLEAN` | A binary value. | boolean | bool | bool | bool | bool | +| `INT8` | A 8-bit signed integer. | int | int | int8 | int8_t | byte | +| `INT16` | A 16-bit signed integer. | int | int | int16 | int16_t | short | +| `INT32` | A 32-bit signed integer. | int | int | int32 | int32_t | int | +| `INT64` | A 64-bit signed integer. | int | int | int64 | int64_t | long | +| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number. | float | float | float32 | float | float | +| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number. | double | double | float64| double | double | +| `BYTES` | A sequence of 8-bit unsigned bytes. | byte[], ByteBuffer, ByteBuf | bytes | []byte | void * | byte[], `ReadOnlySequence<byte>` | +| `STRING` | An Unicode character sequence. | string | str | string| std::string | string | +| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value. | java.sql.Timestamp (java.sql.Time, java.util.Date) | N/A | N/A | N/A | DateTime,TimeSpan | +| `INSTANT`| A single instantaneous point on the timeline with nanoseconds precision. | java.time.Instant | N/A | N/A | N/A | N/A | +| `LOCAL_DATE` | An immutable date-time object that represents a date, often viewed as year-month-day. | java.time.LocalDate | N/A | N/A | N/A | N/A | +| `LOCAL_TIME` | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision. | java.time.LocalDateTime | N/A | N/A | N/A | N/A | +| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second. | java.time.LocalTime | N/A | N/A | N/A | N/A | + +:::note -For primitive types, Pulsar does not store any schema data in `SchemaInfo`. The `type` in `SchemaInfo` determines how to serialize and deserialize the data. +Pulsar does not store any schema data in `SchemaInfo` for primitive types. Some of the primitive schema implementations can use the `properties` parameter to store implementation-specific tunable settings. For example, a string schema can use `properties` to store the encoding charset to serialize and deserialize strings. -Some of the primitive schema implementations can use `properties` to store implementation-specific tunable settings. For example, a `string` schema can use `properties` to store the encoding charset to serialize and deserialize strings. +::: -For more instructions, see [Construct a string schema](schema-get-started.md#construct-a-string-schema). +For more instructions and examples, see [Construct a string schema](schema-get-started.md#string). ### Complex type -Currently, Pulsar supports the following complex types: +The following table outlines the complex types that Pulsar schema supports: | Complex Type | Description | |---|---| -| `KeyValue` | Represents a complex type of a key/value pair. | -| `Struct` | Handles structured data. It supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. | +| `Keyvalue` | Represents a complex key/value pair. | +| `Struct` | Represents structured data, including `AvroBaseStructSchema`, `ProtobufNativeSchema` and `Schema.NATIVE_AVRO`. | #### `KeyValue` schema -`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of key schema and the value schema together. +`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of the key schema and the value schema together. -You can choose the encoding type when constructing the key/value schema.: +Pulsar provides the following methods to encode a **single** key/value pair in a message: * `INLINE` - Key/value pairs are encoded together in the message payload. -* `SEPARATED` - see [Construct a key/value schema](schema-get-started.md#construct-a-keyvalue-schema). +* `SEPARATED` - The Key is stored as a message key, while the value is stored as the message payload. See [Construct a key/value schema](schema-get-started.md#keyvalue) for more details. #### `Struct` schema -`Struct` schema supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. +The following table outlines the `struct` types that Pulsar schema supports: |Type|Description| ---|---| -`AvroBaseStructSchema`|Pulsar uses [Avro Specification](http://avro.apache.org/docs/current/spec.html) to declare the schema definition for `AvroBaseStructSchema`, which supports `AvroSchema`, `JsonSchema`, and `ProtobufSchema`. <br /><br />This allows Pulsar:<br />- to use the same tools to manage schema definitions<br />- to use different serialization or deserialization methods to handle data| -`ProtobufNativeSchema`|`ProtobufNativeSchema` is based on protobuf native Descriptor. <br /><br />This allows Pulsar:<br />- to use native protobuf-v3 to serialize or deserialize data<br />- to use `AutoConsume` to deserialize data. +`AvroBaseStructSchema`|Pulsar uses [Avro Specification](http://avro.apache.org/docs/current/spec.html) to declare the schema definition for `AvroBaseStructSchema`, which supports `AvroSchema`, `JsonSchema`, and `ProtobufSchema`. <br /><br />This allows Pulsar to:<br />- use the same tools to manage schema definitions.<br />- use different serialization or deserialization methods to handle data. | +`ProtobufNativeSchema`|`ProtobufNativeSchema` is based on protobuf native Descriptor. <br /><br />This allows Pulsar to:<br />- use native protobuf-v3 to serialize or deserialize data<br />- use `AutoConsume` to deserialize data.| +`Schema.NATIVE_AVRO` | `Schema.NATIVE_AVRO` is used to wrap a native Avro schema type `org.apache.avro.Schema`. The result is a schema instance that accepts a serialized Avro payload without validating it against the wrapped Avro schema. <br /><br />When you migrate or ingest event or messaging data from external systems (such as Kafka and Cassandra), the data is often already serialized in Avro format. The applications producing the data typically have validated the data against their schemas (including compatibility checks) and stored them in a database or a dedicated service (such as schema registry). The schema of each serialized data record is usually retrievable by some metadata attached to that record. In such cases, a Pulsar producer doesn't need to repeat the schema validation when sending the ingested events to a topic. All it needs to do is pass each message or event with its schema to Pulsar. | Review Comment: We can use `NativeAvroBytesSchema` instead of `Schema.NATIVE_AVRO` to make it consistent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
