cshuo commented on PR #18958:
URL: https://github.com/apache/hudi/pull/18958#issuecomment-4725262176

   Another possible direction is to expose BLOB fields as `BYTES` / `BINARY` in 
Flink DDL, and use table options to tell the connector which columns are BLOB 
fields and whether they should be materialized during reads. For example:
   
   ```sql
   CREATE TABLE media_assets (
     asset_id STRING,
     blob_content BYTES,
     thumbnail BYTES,
     ts BIGINT,
     PRIMARY KEY (asset_id) NOT ENFORCED
   ) WITH (
     'connector' = 'hudi',
     'path' = 's3://bucket/media_assets',
     'table.type' = 'MERGE_ON_READ',
     'hoodie.blob.fields' = 'blob_content,thumbnail',
     'hoodie.blob.read.materialize' = 'true'
   );
   ```
   When `hoodie.blob.read.materialize` is `false`, the `BYTES` value is the 
descriptor bytes. When it is `true`, the connector would materialize and return 
the actual data bytes.
   
   This would let the connector own the read path instead of relying on a 
scalar UDF. The source/operator could then identify the configured BLOB fields 
and batch BLOB reads for optimization, rather than being constrained by per-row 
scalar UDF evaluation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to