[
https://issues.apache.org/jira/browse/NIFI-14628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Handermann updated NIFI-14628:
------------------------------------
Summary: Add Amazon Glue Schema Reference Reader (was: Amazon Glue message
deserialization)
> Add Amazon Glue Schema Reference Reader
> ---------------------------------------
>
> Key: NIFI-14628
> URL: https://issues.apache.org/jira/browse/NIFI-14628
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Extensions
> Affects Versions: 2.4.0
> Reporter: Dariusz Seweryn
> Priority: Major
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> h1. Context
> Using an (Avro)Reader with Schema Access Strategy = Schema Reference Reader,
> which means the schema reference is retrieved from the message itself,
> requires both the Schema Registry and corresponding Schema Reference Reader.
> There is an available ConsumeKinesisStream processor. The Kinesis messages
> can use Amazon Glue Schema Registry. There is a AmazonGlueSchemaRegistry
> service, but there is no corresponding AmazonGlueEncodedSchemaReferenceReader
> service. In this situation there is no possibility to use Glue encoded
> Kinesis messages "the NiFi way".
> By comparison, there are services ConfluentSchemaRegistry and
> ConfulentEncodedSchemaReferenceReader.
> h1. Solution
> Introduce AmazonGlueEncodedSchemaReferenceReader.
> h1. Discussion Points
> h2. Schema Version ID vs SchemaIdentifier
> SchemaReferenceReader contract requires returning a SchemaIdentifier
> This is somewhat problematic because the Glue encoded message header contains
> a Header Version byte, Header Compression byte, UUID (16 bytes) of the
> "Schema Version ID" which is a unique, opaque, identifier of both schema name
> and its version. Amazon API allows to retrieve schema by using (Schema Name +
> optionally Schema Version) XOR Schema Version ID.
> Passing the UUID does not fit well into SchemaIdentifier fields: name
> (string), version ID (long), identifier (long), version (int), branch
> (string).
> How to continue?
> # Pass the UUID in the Branch field.
> No other SchemaIdentifier field would be populated. Since there is an
> expectation that this Reference Reader will be used with
> AmazonGlueSchemaRegistry only we can introduce specialized logic there to
> work when branch is available. Current implementation does not use branch at
> all but expects Name field to be always available.
> # Pass the UUID in the Name field.
> This feels less hacky than using the Branch field. Otherwise would need a
> some kind of a custom prefix/suffix in the passed value, so the
> AmazonGlueSchemaRegistry would know that it should handle this particular
> name as UUID. This approach may introduce schema name conflicts.
> # Introduce yet another SchemaIdentifier field.
> Inspiration for the naming is needed.
> -Personally I would lean to option 1. because it would not introduce any
> breaking changes to existing users and does not require potentially cascading
> changes due to no changes in SchemaIdentifier. Then option 2.-
> After consulting option 2 seems like a good way to go with an addition of a
> specific prefix.
> h2. Support for compression
> In the header of the Glue message there is a byte describing whether message
> is compressed. If so, the data behind the header gets uncompressed before
> further processing. My understanding of NiFi does not give me confidence that
> this is something that can be easily addressed with the current architecture.
> How to continue?
> # Divide into smaller problems
> For now don't support compressed messages
> # Other
> Open for suggestions
> After consulting option 1 seems feasible without major changes in NiFi, which
> should be extracted to a new ticket.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)