[ 
https://issues.apache.org/jira/browse/NIFI-14628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Handermann reassigned NIFI-14628:
---------------------------------------

    Assignee: Dariusz Seweryn

> Add Amazon Glue Schema Reference Reader
> ---------------------------------------
>
>                 Key: NIFI-14628
>                 URL: https://issues.apache.org/jira/browse/NIFI-14628
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>    Affects Versions: 2.4.0
>            Reporter: Dariusz Seweryn
>            Assignee: Dariusz Seweryn
>            Priority: Major
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> h1. Context
> Using an (Avro)Reader with Schema Access Strategy = Schema Reference Reader, 
> which means the schema reference is retrieved from the message itself, 
> requires both the Schema Registry and corresponding Schema Reference Reader.
> There is an available ConsumeKinesisStream processor. The Kinesis messages 
> can use Amazon Glue Schema Registry. There is a AmazonGlueSchemaRegistry 
> service, but there is no corresponding AmazonGlueEncodedSchemaReferenceReader 
> service. In this situation there is no possibility to use Glue encoded 
> Kinesis messages "the NiFi way".
> By comparison, there are services ConfluentSchemaRegistry and 
> ConfulentEncodedSchemaReferenceReader. 
> h1. Solution
> Introduce AmazonGlueEncodedSchemaReferenceReader.
> h1. Discussion Points
> h2. Schema Version ID vs SchemaIdentifier
> SchemaReferenceReader contract requires returning a SchemaIdentifier
> This is somewhat problematic because the Glue encoded message header contains 
> a Header Version byte, Header Compression byte, UUID (16 bytes) of the 
> "Schema Version ID" which is a unique, opaque, identifier of both schema name 
> and its version. Amazon API allows to retrieve schema by using (Schema Name + 
> optionally Schema Version) XOR Schema Version ID. 
> Passing the UUID does not fit well into SchemaIdentifier fields: name 
> (string), version ID (long), identifier (long), version (int), branch 
> (string).
> How to continue?
>  # Pass the UUID in the Branch field.
> No other SchemaIdentifier field would be populated. Since there is an 
> expectation that this Reference Reader will be used with 
> AmazonGlueSchemaRegistry only we can introduce specialized logic there to 
> work when branch is available. Current implementation does not use branch at 
> all but expects Name field to be always available.
>  # Pass the UUID in the Name field.
> This feels less hacky than using the Branch field. Otherwise would need a 
> some kind of a custom prefix/suffix in the passed value, so the 
> AmazonGlueSchemaRegistry would know that it should handle this particular 
> name as UUID. This approach may introduce schema name conflicts.
>  # Introduce yet another SchemaIdentifier field.
> Inspiration for the naming is needed.
> -Personally I would lean to option 1. because it would not introduce any 
> breaking changes to existing users and does not require potentially cascading 
> changes due to no changes in SchemaIdentifier. Then option 2.-
> After consulting option 2 seems like a good way to go with an addition of a 
> specific prefix.
> h2. Support for compression
> In the header of the Glue message there is a byte describing whether message 
> is compressed. If so, the data behind the header gets uncompressed before 
> further processing. My understanding of NiFi does not give me confidence that 
> this is something that can be easily addressed with the current architecture.
> How to continue?
>  # Divide into smaller problems
> For now don't support compressed messages
>  # Other
> Open for suggestions
> After consulting option 1 seems feasible without major changes in NiFi, which 
> should be extracted to a new ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to