lksvenoy-r7 opened a new issue, #8889: URL: https://github.com/apache/pinot/issues/8889
The current pinot flink connector does not gracefully handle errors. Due to the way the connector works, if it errors in the middle of adding segments to a table, the table ends up with an inconsistent view. Additionally, the connector does not currently support refresh tables. Refresh tables require atomic segment replacement, but the connector currently naively uploads segments as they are built. From testing the connector in production, I've also identified a few performance issues. These have a few different causes; The AVRO serialization is not configurable, nor is the file writing configurable (for example for different block sizes). I have written a flink connector based on this one, but with some heavy amendments. First of all, it implements WithPostCommitTopology<GenericRecord, PinotSinkCommittable> from flink, implementing a global committer. It does work in a few different stages: 1. Operator is responsible for sending serialized AVRO records directly to the sink 2. The sink writer is responsible for building and flushing segments to disk 3. The sink committer (before global commit) is responsible for uploading the segments to a location that is reachable by all nodes in the flink cluster (In my case, to S3 deep store) 4. The global sink committer executes the segment replacement protocol defined in the Pinot SDK. This sink is currently only compatible with REFRESH type tables that want to replace all segments on every single job execution. It takes care of atomically replacing the segments for the table, and performs well due to the way it does the hard work upfront. I am open to sharing this code so that it can be merged into the pinot repository, but it does have some limitations. - No checkpointing - Only BATCH execution mode is supported at the moment - Only REFRESH tables are supported at the moment (Full segment replacement) - The connector currently bypasses certain Pinot conventions (such as using certain attributes defined in the batch config, and so on). This would need to be approached with scrutiny to ensure the code is in-line with the rest of the repository. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
