SaiShashank12 commented on PR #36309: URL: https://github.com/apache/beam/pull/36309#issuecomment-3413027813
Great questions @jrmccluskey ! Let me explain the architectural reasoning behind the separate implementation and address your concerns about naming and infrastructure. **Note:** I actually raised these same architectural concerns on the [dev mailing list](https://lists.apache.org/[email protected]) before implementing this approach - specifically about whether to update the existing handler or create separate files. The feedback I received was to bring this discussion to the PR review, so I'm glad you're raising these points here! ## Why Separate Files? The Core Technical Difference The separation isn't primarily about the API version change (v2 → v3), but rather a **fundamental workflow improvement** that solves a critical production issue with TensorRT. ### The Environment Alignment Problem TensorRT engine files must be compiled on the **exact same hardware, CUDA version, and TensorRT version** where inference will run. This is [documented by NVIDIA](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#compatibility-serialized-engines) as a hard requirement. Pre-building engines on a dev machine and deploying to Dataflow workers almost always causes compatibility failures. ### How the Old Handler Works ```python # tensorrt_inference.py - Manual workflow handler = TensorRTEngineHandlerNumPy(...) network, builder = handler.load_onnx() # User must call this engine = handler.build_engine(network, builder) # Then call this # Notably: load_onnx() is only used in test files, not production ``` ### The New Handler's Key Innovation ```python # trt_handler_numpy_compact.py - Automatic on-worker building handler = TensorRTEngineHandlerNumPy( onnx_path='gs://bucket/model.onnx', # Portable ONNX build_on_worker=True # Builds IN the Dataflow worker ) engine = handler.load_model() # Engine built automatically on correct hardware! ``` This eliminates an entire class of deployment failures: - ✅ ONNX models are fully portable across environments - ✅ Engine built with the exact GPU/CUDA/TensorRT version used for inference - ✅ No "works on my machine" issues when deploying to Dataflow - ✅ Production-ready workflow (vs test-only `load_onnx()`) ### Example Scenario This Solves ``` Developer machine: Dataflow worker: - GPU: RTX 4090 - GPU: Tesla T4 - CUDA: 12.8 - CUDA: 12.6 - TensorRT: 10.7 - TensorRT: 10.5 Pre-built engine → FAILS with cryptic errors ❌ ONNX + on-worker build → Works perfectly ✅ ``` ## Addressing Your Specific Concerns ### 1. Naming Inconsistency (`trt_handler_numpy_compact.py`) You're absolutely right - this doesn't follow the naming convention. I'm happy to rename to maintain consistency: - Option A: `tensorrt_inference_v10.py` (clear version indicator) - Option B: `tensorrt_inference_auto.py` (highlights auto-building) - Preferred: **`tensorrt_inference_v10.py`** to match `tensorrt_inference.py` ### 2. Docker & README Duplication Agreed - these can absolutely be consolidated into the existing `tensorrt_runinference/` directory. I can restructure as: ``` tensorrt_runinference/ ├── README.md (updated with both versions) ├── legacy/ │ └── tensorrt_8x.dockerfile └── v10/ └── tensorrt_10x.dockerfile ``` ### 3. Code Reuse & Future Scalability There's definitely shared logic (~60-70%) that could be refactored. This was actually one of my key concerns when I reached out to the dev list - whether to prioritize clean separation or code reuse. My concern with merging into the existing handler is: - Risk of breaking existing users who rely on the manual `load_onnx()`/`build_engine()` pattern - The API differences are significant enough that conditional logic might make the code harder to follow ## Proposed Refactoring Options ### Option A: Version-Aware Single Handler (Most maintainable) ```python # tensorrt_inference.py class TensorRTEngineHandlerNumPy: def __init__(self, ..., tensorrt_version=None, build_on_worker=False): if tensorrt_version is None: tensorrt_version = self._detect_tensorrt_version() if tensorrt_version >= 10: self._use_tensor_api = True else: self._use_tensor_api = False self._build_on_worker = build_on_worker def load_model(self): if self._build_on_worker and self.onnx_path: # Automatic ONNX → engine conversion return self._build_engine_from_onnx() else: # Legacy behavior return self._load_prebuilt_engine() ``` ### Option B: Base Class with Version Subclasses ```python # tensorrt_inference_base.py - Shared logic # tensorrt_inference_v9.py - Pre-10.x (legacy buffer API) # tensorrt_inference_v10.py - 10.x+ (Tensor API + auto-building) ``` ### Option C: Keep Separate, Fix Naming (Current + improvements) - Rename to `tensorrt_inference_v10.py` - Add deprecation notice to old handler encouraging 10.x adoption - Consolidate Docker infrastructure - Extract truly shared code (e.g., metrics, batching) to a utility module ## My Recommendation I'd lean toward **Option A** if the on-worker building feature is valuable enough to backport to the old handler, or **Option C** if we want to keep clean separation for stability. Since the dev list suggested bringing this architectural discussion here, I'm very open to your expertise on what fits best with Beam's patterns. **Question for you:** Does Beam typically prefer version-aware handlers (like PyTorch's different model loaders) or separate handlers per major version? That would help me align with the project's patterns. ## Summary The architectural decision for separate files was driven by: 1. **Solving a real production problem** (environment compatibility via on-worker building) 2. **Enabling a new workflow** that wasn't production-ready in the old handler 3. **Avoiding breaking changes** to existing users However, I completely agree with your concerns about: - ❌ Naming inconsistency → Will fix - ❌ Infrastructure duplication → Will consolidate - ❌ Code duplication → Open to refactoring I'm happy to implement whichever approach aligns best with Beam's architecture patterns. Let me know your preference and I'll refactor accordingly! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
