MisterRaindrop commented on PR #1571: URL: https://github.com/apache/cloudberry/pull/1571#issuecomment-3949722894
> > > Overall, FDW parallel scan is a direction worth exploring, but this approach is too rough. The core problems are: > > > > > > 1. locus transition semantics for Gather in an MPP context haven't been thought through, and the changes are too broad. > > > 2. FDW is a black box from the database's perspective. > > > For heap tables we have parallel scan (divide work by pages), for AO/AOCS we have parallel scan (divide work by files) — the work partitioning is well-defined. > > > But for FDWs, the parallel behavior depends entirely on the FDW's own implementation. If an FDW (say file_fdw) sets parallel_safe = true following planner's parallel logic but doesn't actually implement the DSM parallel callbacks (EstimateDSMForeignScan, InitializeDSMForeignScan, InitializeWorkerForeignScan), then multiple workers will each scan the full dataset, producing duplicate rows. > > > > > > I'm not very familiar with Cloudberry. Still learning. > > FDW itself is a black box. Its specific implementation largely depends on how the user implements it. My understanding is that users need to take responsibility for their own implementations. Additionally, I should only enable gather for FDW. In other cases, it should remain false, this will parallel processing advantages of PostgreSQL? > > Additionally, I've looked into other aspects of FDW parallelism. Currently, it seems there is no optimal solution. > > So, should we aim to implement parallelism that is transparent to users? Or are there better approaches? Could you share some idea? > > Neither PostgreSQL nor Cloudberry supports parallel FDW scans, that's a deliberate decision, not an oversight. > > On the implementation side: having the kernel generate partial paths for FDW will cause FDWs that don't implement parallel scan callbacks to silently produce wrong results (e.g. duplicate rows). That's a kernel bug, not a user error — we can't shift that responsibility to FDW authors. And mixing Gather with CBDB-style parallelism remains fundamentally broken — the locus handling is wrong, and none of the issues I raised (joins, locus transitions, the overly broad execMain.c change) have been addressed. > > More importantly, before discussing how, we need to answer why. What real-world problem does this solve in an MPP system where FDW is already used across segments? And given the risks I mentioned above — broken locus transitions, silent wrong results for existing FDWs, untested join/subquery interactions — even if it can be done, is it worth the complexity? If you want to push this forward, you need to make the case clearly: what's the motivation, and convince us that all the issues raised have sound solutions. Parallel FDW primarily addresses the issue of slow data loading. This functionality was already implemented in earlier versions of PostgreSQL. Now, I am attempting to integrate this feature into MPP systems. In simple tests, parallelization has indeed delivered a performance improvement of one to two times. Such gains are essential for performance-sensitive business scenarios. Therefore, I am working to introduce this functionality. Alternatively, we could discuss the implementation plan in the issue tracker. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
