SaiShashank12 commented on PR #36309:
URL: https://github.com/apache/beam/pull/36309#issuecomment-3413027813

   Great questions @jrmccluskey ! Let me explain the architectural reasoning 
behind the separate implementation and address your concerns about naming and 
infrastructure.
   
   **Note:** I actually raised these same architectural concerns on the [dev 
mailing list](https://lists.apache.org/[email protected]) before 
implementing this approach - specifically about whether to update the existing 
handler or create separate files. The feedback I received was to bring this 
discussion to the PR review, so I'm glad you're raising these points here!
   
   ## Why Separate Files? The Core Technical Difference
   
   The separation isn't primarily about the API version change (v2 → v3), but 
rather a **fundamental workflow improvement** that solves a critical production 
issue with TensorRT.
   
   ### The Environment Alignment Problem
   
   TensorRT engine files must be compiled on the **exact same hardware, CUDA 
version, and TensorRT version** where inference will run. This is [documented 
by 
NVIDIA](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#compatibility-serialized-engines)
 as a hard requirement. Pre-building engines on a dev machine and deploying to 
Dataflow workers almost always causes compatibility failures.
   
   ### How the Old Handler Works
   
   ```python
   # tensorrt_inference.py - Manual workflow
   handler = TensorRTEngineHandlerNumPy(...)
   network, builder = handler.load_onnx()  # User must call this
   engine = handler.build_engine(network, builder)  # Then call this
   # Notably: load_onnx() is only used in test files, not production
   ```
   
   ### The New Handler's Key Innovation
   
   ```python
   # trt_handler_numpy_compact.py - Automatic on-worker building
   handler = TensorRTEngineHandlerNumPy(
       onnx_path='gs://bucket/model.onnx',  # Portable ONNX
       build_on_worker=True  # Builds IN the Dataflow worker
   )
   engine = handler.load_model()  # Engine built automatically on correct 
hardware!
   ```
   
   This eliminates an entire class of deployment failures:
   - ✅ ONNX models are fully portable across environments
   - ✅ Engine built with the exact GPU/CUDA/TensorRT version used for inference
   - ✅ No "works on my machine" issues when deploying to Dataflow
   - ✅ Production-ready workflow (vs test-only `load_onnx()`)
   
   ### Example Scenario This Solves
   
   ```
   Developer machine:          Dataflow worker:
   - GPU: RTX 4090            - GPU: Tesla T4
   - CUDA: 12.8              - CUDA: 12.6
   - TensorRT: 10.7          - TensorRT: 10.5
   
   Pre-built engine → FAILS with cryptic errors ❌
   ONNX + on-worker build → Works perfectly ✅
   ```
   
   ## Addressing Your Specific Concerns
   
   ### 1. Naming Inconsistency (`trt_handler_numpy_compact.py`)
   
   You're absolutely right - this doesn't follow the naming convention. I'm 
happy to rename to maintain consistency:
   - Option A: `tensorrt_inference_v10.py` (clear version indicator)
   - Option B: `tensorrt_inference_auto.py` (highlights auto-building)
   - Preferred: **`tensorrt_inference_v10.py`** to match `tensorrt_inference.py`
   
   ### 2. Docker & README Duplication
   
   Agreed - these can absolutely be consolidated into the existing 
`tensorrt_runinference/` directory. I can restructure as:
   ```
   tensorrt_runinference/
   ├── README.md (updated with both versions)
   ├── legacy/
   │   └── tensorrt_8x.dockerfile
   └── v10/
       └── tensorrt_10x.dockerfile
   ```
   
   ### 3. Code Reuse & Future Scalability
   
   There's definitely shared logic (~60-70%) that could be refactored. This was 
actually one of my key concerns when I reached out to the dev list - whether to 
prioritize clean separation or code reuse. My concern with merging into the 
existing handler is:
   - Risk of breaking existing users who rely on the manual 
`load_onnx()`/`build_engine()` pattern
   - The API differences are significant enough that conditional logic might 
make the code harder to follow
   
   ## Proposed Refactoring Options
   
   ### Option A: Version-Aware Single Handler (Most maintainable)
   
   ```python
   # tensorrt_inference.py
   class TensorRTEngineHandlerNumPy:
       def __init__(self, ..., tensorrt_version=None, build_on_worker=False):
           if tensorrt_version is None:
               tensorrt_version = self._detect_tensorrt_version()
   
           if tensorrt_version >= 10:
               self._use_tensor_api = True
           else:
               self._use_tensor_api = False
   
           self._build_on_worker = build_on_worker
   
       def load_model(self):
           if self._build_on_worker and self.onnx_path:
               # Automatic ONNX → engine conversion
               return self._build_engine_from_onnx()
           else:
               # Legacy behavior
               return self._load_prebuilt_engine()
   ```
   
   ### Option B: Base Class with Version Subclasses
   
   ```python
   # tensorrt_inference_base.py - Shared logic
   # tensorrt_inference_v9.py - Pre-10.x (legacy buffer API)
   # tensorrt_inference_v10.py - 10.x+ (Tensor API + auto-building)
   ```
   
   ### Option C: Keep Separate, Fix Naming (Current + improvements)
   
   - Rename to `tensorrt_inference_v10.py`
   - Add deprecation notice to old handler encouraging 10.x adoption
   - Consolidate Docker infrastructure
   - Extract truly shared code (e.g., metrics, batching) to a utility module
   
   ## My Recommendation
   
   I'd lean toward **Option A** if the on-worker building feature is valuable 
enough to backport to the old handler, or **Option C** if we want to keep clean 
separation for stability.
   
   Since the dev list suggested bringing this architectural discussion here, 
I'm very open to your expertise on what fits best with Beam's patterns. 
**Question for you:** Does Beam typically prefer version-aware handlers (like 
PyTorch's different model loaders) or separate handlers per major version? That 
would help me align with the project's patterns.
   
   ## Summary
   
   The architectural decision for separate files was driven by:
   1. **Solving a real production problem** (environment compatibility via 
on-worker building)
   2. **Enabling a new workflow** that wasn't production-ready in the old 
handler
   3. **Avoiding breaking changes** to existing users
   
   However, I completely agree with your concerns about:
   - ❌ Naming inconsistency → Will fix
   - ❌ Infrastructure duplication → Will consolidate
   - ❌ Code duplication → Open to refactoring
   
   I'm happy to implement whichever approach aligns best with Beam's 
architecture patterns. Let me know your preference and I'll refactor 
accordingly!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to