ysh329 opened a new issue, #17178: URL: https://github.com/apache/tvm/issues/17178
# Introduction The TVM community has worked since the v0.17.0 release to deliver the following new exciting improvements! This release version is: The main tags are below (**bold text is with lots of progress**): - Community, RFCs - Adreno, ArmComputeLibrary, Metal, cuda & cutlass & tensorrt, micoNPU, Runtime - **Relax**, **Dlight**, **Disco** - Arith, **TIR**, TVMScript - Docs, CI, **Misc**, **BugFix** Please visit the full listing of commits for a complete view: [v0.17.dev0...v0.17.0](https://github.com/apache/tvm/compare/v0.17.dev0...v0.17.0). ### Community * [#17018](https://github.com/apache/tvm/pull/17018) - New committer: Balint Cristian ### RFCs This new RFC added an open, standardized format for neural network exchange developed by the Khronos Group since 2018 (https://www.khronos.org/nnef). It is aimed at deploying trained neural networks from deep learning frameworks to proprietary inference engines of neural network hardware vendors. * [#108](https://github.com/apache/tvm-rfcs/pull/108) - [RFC] [RFC] Add NNEF frontend ---- ### AOT * [#17077](https://github.com/apache/tvm/pull/17077) - Correctly calculate workspace for vector types ### Adreno * [#16927](https://github.com/apache/tvm/pull/16927) - [SCRIPT]Fix in build config for adreno ### BYOC * [#16895](https://github.com/apache/tvm/pull/16895) - Add layout check and update shape check for cublas FP8 BYOC ### BugFix * [#17138](https://github.com/apache/tvm/pull/17138) - [Fix][TIR] Fix outdated call to create extern buffer in make_extern * [#17132](https://github.com/apache/tvm/pull/17132) - Restrict CopyOnWrite to _type_final * [#17096](https://github.com/apache/tvm/pull/17096) - Update FAttrsGetter to return Map<String, ObjectRef> * [#17078](https://github.com/apache/tvm/pull/17078) - [NCCL] Release NCCL thread_local resources in destructor * [#17044](https://github.com/apache/tvm/pull/17044) - [Support] Fix copy constructor for support::OrderedSet * [#17000](https://github.com/apache/tvm/pull/17000) - [MSC] split name_string with index by colon from the right * [#16923](https://github.com/apache/tvm/pull/16923) - [Fix][Dlight] Fix GeneralReduction for log-sum-exp * [#16924](https://github.com/apache/tvm/pull/16924) - [Fix] Fix SSA conversion for SizeVar retention * [#16903](https://github.com/apache/tvm/pull/16903) - CudaDeviceAPI::GetAttr may check kExist when GPUs absent * [#16901](https://github.com/apache/tvm/pull/16901) - rocm shared memory issue on MI250 ### CI * [#17055](https://github.com/apache/tvm/pull/17055) - [SME][Test] Add additional conv2d tests for asymmetric parameters * [#17007](https://github.com/apache/tvm/pull/17007) - [TOPI][Testing] Enable conv2d NHWC fp16 topi testing for `arm_cpu` * [#16930](https://github.com/apache/tvm/pull/16930) - [UnitTest] Use pytest's scope='session' for tvm.testing.parameter * [#16948](https://github.com/apache/tvm/pull/16948) - Update image tag to 20240428-060115-0b09ed018 * [#16931](https://github.com/apache/tvm/pull/16931) - Use LLVM17 for tests on `ci_cpu` * [#16942](https://github.com/apache/tvm/pull/16942) - Enable Conda setup v3 * [#16939](https://github.com/apache/tvm/pull/16939) - Upgrade CUDA to 12.4 ### CRT * [#17097](https://github.com/apache/tvm/pull/17097) - [Bugfix]Return error code on error from ModuleGetFunction ### Disco * [#17035](https://github.com/apache/tvm/pull/17035) - [QoL] Implement broadcast/scatter methods for Session * [#16992](https://github.com/apache/tvm/pull/16992) - [Bugfix]Handle NDArray larger than OS buffer for pipe * [#16978](https://github.com/apache/tvm/pull/16978) - Implement `num_workers` property for `disco.Session` * [#16989](https://github.com/apache/tvm/pull/16989) - Treat hangup of disco worker process as kShutdown * [#16993](https://github.com/apache/tvm/pull/16993) - Allow allocation that only exists on worker0 * [#16979](https://github.com/apache/tvm/pull/16979) - Expose disco.Session.shutdown through the python API * [#16919](https://github.com/apache/tvm/pull/16919) - Improve error message for CallPacked ### Dlight * [#17082](https://github.com/apache/tvm/pull/17082) - Use 16x32 spatial x reduction thread extents in GEMV scheduling * [#17052](https://github.com/apache/tvm/pull/17052) - Skip GEMV rules when more than one vector * [#17026](https://github.com/apache/tvm/pull/17026) - Perf improvement for low_batch_gemv on Metal * [#17016](https://github.com/apache/tvm/pull/17016) - Update Adreno GEMV Rules * [#16972](https://github.com/apache/tvm/pull/16972) - [GPU] Enhance opencl thread limit for schedules * [#16973](https://github.com/apache/tvm/pull/16973) - [GPU] Improved gemv outer fallback schedule * [#16958](https://github.com/apache/tvm/pull/16958) - Check for target in function attributes * [#16894](https://github.com/apache/tvm/pull/16894) - Enhance vectorization for gpu matmul * [#16884](https://github.com/apache/tvm/pull/16884) - Add check for matmul dtype and fix reduction rule ### Docs * [#17146](https://github.com/apache/tvm/pull/17146) - [DOC] Fix typo for the "We utilize the intermediate representation of nn.Graph to convert the OneFlow model to Reley." * [#17015](https://github.com/apache/tvm/pull/17015) - [DOC] Update Model Links to Include Commit ### Frontend * [#17014](https://github.com/apache/tvm/pull/17014) - [ArgParse] Pass default values to target compiler(#13264) * [#16961](https://github.com/apache/tvm/pull/16961) - [Bugfix][ONNX] Improve broadcast and batch_matmul conversion * [#16936](https://github.com/apache/tvm/pull/16936) - [TFLite] Add support for GELU conversion ### Hexagon * [#17123](https://github.com/apache/tvm/pull/17123) - Add support for v75 ### LLVM * [#17046](https://github.com/apache/tvm/pull/17046) - [Arith][SVE] Add rewrite rules for indices split by scalable expressions * [#16966](https://github.com/apache/tvm/pull/16966) - [SVE] Add support for representing and creating buffer-level predicates * [#17001](https://github.com/apache/tvm/pull/17001) - [SVE] Use only powers of two as possible vscale values * [#16962](https://github.com/apache/tvm/pull/16962) - [SVE] Add codegen support for `vscale_range()` function attribute * [#16968](https://github.com/apache/tvm/pull/16968) - Stringref API deprecation fixes * [#16965](https://github.com/apache/tvm/pull/16965) - [SVE] Add get_active_lane_mask builtin * [#16899](https://github.com/apache/tvm/pull/16899) - [SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for `arm_cpu` * [#16893](https://github.com/apache/tvm/pull/16893) - [SVE] Check for SVE target in VectorizeLoop * [#16862](https://github.com/apache/tvm/pull/16862) - [SVE] Support splitting by vscale in `tir::split` and `te::split` ### MetaSchedule * [#17012](https://github.com/apache/tvm/pull/17012) - [BugFix]MultiLevelTilingTensorCore generates inconsistent thread-binding sketch for batched matmul * [#17066](https://github.com/apache/tvm/pull/17066) - [BugFix]Fix TensorIntrin ‘dot_4x4_i8i8s32_sdot’ is not registered ### Metal * [#17059](https://github.com/apache/tvm/pull/17059) - Enable Debug Label * [#17025](https://github.com/apache/tvm/pull/17025) - Support metal device profiling ### OpenCL & CLML * [#16933](https://github.com/apache/tvm/pull/16933) - [CLML] Fix in clml pattern check condition * [#16929](https://github.com/apache/tvm/pull/16929) - [VM][OPENCL] Take advantage of OpenCL host ptr for improved copy ### ROCm * [#17141](https://github.com/apache/tvm/pull/17141) - [Backend]Fix error when building TVM with LLVM 19 ### Relax * [#17139](https://github.com/apache/tvm/pull/17139) - Fix cublas dispatch for corner cases * [#17127](https://github.com/apache/tvm/pull/17127) - [KVCache] Support fork in sliding window sink part * [#17115](https://github.com/apache/tvm/pull/17115) - Support `input_axis_separator` to allow 2D to 1D conversion * [#17119](https://github.com/apache/tvm/pull/17119) - [Bugfix]Set purity=false for LazySetOutput * [#17118](https://github.com/apache/tvm/pull/17118) - [VM] Improved error messages for mismatched parameter count * [#17110](https://github.com/apache/tvm/pull/17110) - Alloc BYOC workspace with R.builtin.alloc_tensor * [#17089](https://github.com/apache/tvm/pull/17089) - [ONNX] Add support for HardSigmoid * [#17100](https://github.com/apache/tvm/pull/17100) - [KVCache] Unlimited depth blocks * [#17075](https://github.com/apache/tvm/pull/17075) - [Transform] Modify FuseTIR pass to propagate buffer attributes * [#17088](https://github.com/apache/tvm/pull/17088) - [ONNX] Add support for HardSwish * [#17085](https://github.com/apache/tvm/pull/17085) - [PyTorch] Add support for torch.nn.Hardsigmoid * [#17083](https://github.com/apache/tvm/pull/17083) - [TVMScript]Preserve tir.SizeVar through TVMScript round-trip * [#17086](https://github.com/apache/tvm/pull/17086) - Ignore dynamic parameters in RewriteDataflowReshape * [#17084](https://github.com/apache/tvm/pull/17084) - [PyTorch] Add support for torch.nn.Hardswish * [#17074](https://github.com/apache/tvm/pull/17074) - [KVCache][Test] Fix TIR attn kernels for uncommon group size * [#17067](https://github.com/apache/tvm/pull/17067) - Add missing white spaces in error messages * [#17061](https://github.com/apache/tvm/pull/17061) - [Frontend][Onnx] Cast Op special handling for ShapeExpr input * [#17033](https://github.com/apache/tvm/pull/17033) - [Bugfix] Apply FuseOps to nested DataflowBlock * [#17032](https://github.com/apache/tvm/pull/17032) - [Bugfix] Annotate ComputePrimValue output as host function * [#17034](https://github.com/apache/tvm/pull/17034) - [Bugfix] Bind symbolic variables in R.match_cast * [#16960](https://github.com/apache/tvm/pull/16960) - [UnitTest] Validate IRModule with multiple targets * [#16995](https://github.com/apache/tvm/pull/16995) - [KVCache] Support KVCache decode from forked sequence and pop more tokens * [#16959](https://github.com/apache/tvm/pull/16959) - [Transform] Handle identical PrimFunc with distinct VDevice * [#16589](https://github.com/apache/tvm/pull/16589) - [Unity] Check for transpose and dynamic shape in AdjustMatmulOrder * [#16988](https://github.com/apache/tvm/pull/16988) - [KVCache] Fix the aux data syncing order of paged KV cache * [#16922](https://github.com/apache/tvm/pull/16922) - [BugFix]change FuseOpsByPattern strategy to pattern-match maximal subgraph * [#16982](https://github.com/apache/tvm/pull/16982) - [Unity][BYOC] Use arith.Analyzer to check batch equality of matmul in cublas * [#16955](https://github.com/apache/tvm/pull/16955) - Implement relax.op.view * [#16971](https://github.com/apache/tvm/pull/16971) - Support nested ModuleList in nn.Module * [#16826](https://github.com/apache/tvm/pull/16826) - Express dynamic arguments of strided_slice as arguments * [#16476](https://github.com/apache/tvm/pull/16476) - [Unity][Cutlass] Fix C source generation of dense operation * [#16940](https://github.com/apache/tvm/pull/16940) - Allow PrimValue as index in relax.op.take * [#16934](https://github.com/apache/tvm/pull/16934) - [TIR] Introduce new `cumsum` op for gpu * [#16859](https://github.com/apache/tvm/pull/16859) - [QoL]Use SeqExpr in IR types when SeqExpr is required * [#16904](https://github.com/apache/tvm/pull/16904) - Prevent to generate duplicate func in dispatch_sort_scan * [#16905](https://github.com/apache/tvm/pull/16905) - [Bugfix]Raise exception for OOM allocation * [#16827](https://github.com/apache/tvm/pull/16827) - Handle binary operations between Tensor and PrimValue * [#16902](https://github.com/apache/tvm/pull/16902) - Allow specifying entry_funcs for BYOC * [#16860](https://github.com/apache/tvm/pull/16860) - [QoL]Infer StructInfo for relax::Tuple on construction * [#16861](https://github.com/apache/tvm/pull/16861) - [QoL]Return well-formed IR from relax::Function::CreateEmpty * [#16886](https://github.com/apache/tvm/pull/16886) - [Frontend] Fix sort, argsort and topk in nn module * [#16883](https://github.com/apache/tvm/pull/16883) - Stabilize relax pass mutation order ### Relay * [#16983](https://github.com/apache/tvm/pull/16983) - [BugFix]skip leaf args when matching 'path' part for dominator pattern * [#16996](https://github.com/apache/tvm/pull/16996) - fixed to make TupleGetItem inherits the previous span ### Runtime * [#17057](https://github.com/apache/tvm/pull/17057) - Stateless interface of PagedKVCache leaf node commit * [#17049](https://github.com/apache/tvm/pull/17049) - Support PagedKVCache with tree attention * [#17045](https://github.com/apache/tvm/pull/17045) - Fix PagedKVCache for PopN and enhance tests * [#16998](https://github.com/apache/tvm/pull/16998) - Compatibility with dmlc::Stream API changes * [#17037](https://github.com/apache/tvm/pull/17037) - [ROCm] Enable ROCm host memory support * [#17036](https://github.com/apache/tvm/pull/17036) - Use preferred host memory (pinned memory) in KV cache * [#16994](https://github.com/apache/tvm/pull/16994) - Allow query of available device memory through DeviceAPI * [#16997](https://github.com/apache/tvm/pull/16997) - [Disco] Restore checks for hangup of disco pipe * [#16938](https://github.com/apache/tvm/pull/16938) - Allow offset to be specified in NDArray::CreateView * [#16890](https://github.com/apache/tvm/pull/16890) - [VULKAN] Support total_global_memory * [#16880](https://github.com/apache/tvm/pull/16880) - Implemented Datatype.itemsize() ### TIR * [#17134](https://github.com/apache/tvm/pull/17134) - [Schedule] Remove `@type_check` for `set_axis_separator` * [#17112](https://github.com/apache/tvm/pull/17112) - [DLight] Enable SimdGroup op for Metal * [#17098](https://github.com/apache/tvm/pull/17098) - [RPC] Allow RPC calls to compiled PrimFuncs with no arguments * [#17039](https://github.com/apache/tvm/pull/17039) - Fix Bug in VectorizeLoop * [#17030](https://github.com/apache/tvm/pull/17030) - Fix Shuffle rewrite * [#16947](https://github.com/apache/tvm/pull/16947) - Support narrow dtype for let binding * [#16952](https://github.com/apache/tvm/pull/16952) - Enhance CLZ intrinsic support * [#16945](https://github.com/apache/tvm/pull/16945) - [Compute-at] Make compute-ated block simple when the predicate could be merged * [#16879](https://github.com/apache/tvm/pull/16879) - Make T.reinterpret nop when dtype is the same ### TOPI * [#17091](https://github.com/apache/tvm/pull/17091) - Add dense schedule for fp16 and fp32 using gemm * [#17048](https://github.com/apache/tvm/pull/17048) - [SME]Add conv2d NHWC SME fp16->fp32 schedule * [#17040](https://github.com/apache/tvm/pull/17040) - Fix SME conv2d schedule import and intrin argument * [#17003](https://github.com/apache/tvm/pull/17003) - [SME]Add conv2d NHWC SME fp32 schedule * [#16977](https://github.com/apache/tvm/pull/16977) - Remove `blockIdx.z` in topi sort * [#16951](https://github.com/apache/tvm/pull/16951) - Revert unification of conv2d NHWC hybrid scheduling for `arm_cpu` targets ### TVMScript * [#17107](https://github.com/apache/tvm/pull/17107) - Better Type Annotation for TIR OP * [#16967](https://github.com/apache/tvm/pull/16967) - Fix error reporting inside Macro func * [#16916](https://github.com/apache/tvm/pull/16916) - Support `T.launch_thread` with i64 dtype * [#16876](https://github.com/apache/tvm/pull/16876) - Optionally use `ruff format` instead of `black` * [#16877](https://github.com/apache/tvm/pull/16877) - [Bug] Add test case for missing symbolic bounds ### cuda & cutlass & tensorrt * [#16980](https://github.com/apache/tvm/pull/16980) - [Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state ### web * [#17031](https://github.com/apache/tvm/pull/17031) - Fix string to uint8 array for special characters * [#17028](https://github.com/apache/tvm/pull/17028) - Add dtype and offset for CreateView in runtime * [#16910](https://github.com/apache/tvm/pull/16910) - Support string[] in setPackedFunc() and exceptionally long arrays ### Misc * [#17135](https://github.com/apache/tvm/pull/17135) - [QoL][IR] Provide default constructor for NameSupply/GlobalVarSupply * [#17125](https://github.com/apache/tvm/pull/17125) - [Utils] Define line-length for "ruff format" * [#17152](https://github.com/apache/tvm/pull/17152) - GraphExecutor: Fix wild pointer assign when input and output are reshape * [#17150](https://github.com/apache/tvm/pull/17150) - [WebGPU] Fall back to 256MB for maxBufferSize if needed * [#17128](https://github.com/apache/tvm/pull/17128) - [Compute-inline] Prefer T.where for reverse compute-inlined block with predicate * [#16976](https://github.com/apache/tvm/pull/16976) - [WebGPU] Implement `tir.dp4a` with WGSL built-in function `dot4I8Packed` * [#17124](https://github.com/apache/tvm/pull/17124) - [WebGPU] Add `tir.dp4a` * [#17113](https://github.com/apache/tvm/pull/17113) - [CudaGraph] Handle exceptions thrown while capturing cuda graph * [#17094](https://github.com/apache/tvm/pull/17094) - [Utility][Container] Support non-nullable types in Array::Map * [#17101](https://github.com/apache/tvm/pull/17101) - [RPC] Raise error if server process terminated * [#17092](https://github.com/apache/tvm/pull/17092) - [UnitTests] Use tvm.ir.assert_structural_equal whenever possible * [#17054](https://github.com/apache/tvm/pull/17054) - [SME] Utilize predication in fp32 matmul and conv2d schedules * [#17079](https://github.com/apache/tvm/pull/17079) - [CMake] Show NVCC include directories in compile_commands.json * [#17076](https://github.com/apache/tvm/pull/17076) - [SME] Extract gemm block correctly when fused with bias * [#17071](https://github.com/apache/tvm/pull/17071) - [WebGPU] Translate `int8x4` into `u32` * [#17065](https://github.com/apache/tvm/pull/17065) - [FP8][Codegen] Add make_fp8 vector constructors * [#17064](https://github.com/apache/tvm/pull/17064) - Add docs of v0.15.0 and v0.16.0 * [#16985](https://github.com/apache/tvm/pull/16985) - [CODEGEN] Vector-Codegen support for llvm-pure-intrin * [#17058](https://github.com/apache/tvm/pull/17058) - Introduce outer reduction for metal * [#17051](https://github.com/apache/tvm/pull/17051) - Use adapter.info when available instead of requestAdapterInfo * [#16981](https://github.com/apache/tvm/pull/16981) - [SME] Add scalable fp16->fp32 dense schedule * [#17029](https://github.com/apache/tvm/pull/17029) - [Contrib] Implement NDArray cache update * [#17027](https://github.com/apache/tvm/pull/17027) - [picojson] Let objects be ordered when serializing * [#17021](https://github.com/apache/tvm/pull/17021) - [WebGPU] Update error messages to be more user-friendly * [#17010](https://github.com/apache/tvm/pull/17010) - Support multinomial_from_uniform dispatch * [#16999](https://github.com/apache/tvm/pull/16999) - [USMP] add missing const specifier for global_const_workspace * [#17005](https://github.com/apache/tvm/pull/17005) - [WebGPU] Handle device OOM in createBuffer * [#16921](https://github.com/apache/tvm/pull/16921) - [SME] Introduce scalable fp32 dense schedule * [#16957](https://github.com/apache/tvm/pull/16957) - chore: remove repetitive words * [#16909](https://github.com/apache/tvm/pull/16909) - [QoL][IR] Provide std::hash and std::equal_to for IR Variable types * [#16987](https://github.com/apache/tvm/pull/16987) - [JVM] Automatic Compatibility of JVM AttachCurrentThread * [#16974](https://github.com/apache/tvm/pull/16974) - [CUBLAS][FP8] Enable R.matmul + R.multiply offloading * [#16896](https://github.com/apache/tvm/pull/16896) - [CUBLAS] Enable offloading of R.matmul + R.dequantize * [#16956](https://github.com/apache/tvm/pull/16956) - Add script for testing release package * [#16908](https://github.com/apache/tvm/pull/16908) - Overriding the StructuralEqual() for easy usage * [#16932](https://github.com/apache/tvm/pull/16932) - Enable gemv schedule for adreno * [#16935](https://github.com/apache/tvm/pull/16935) - [3rdparty] Bump FlashInfer for sampling functions * [#16937](https://github.com/apache/tvm/pull/16937) - [Thrust] Increase static workspace size * [#16915](https://github.com/apache/tvm/pull/16915) - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 2 * [#16741](https://github.com/apache/tvm/pull/16741) - Restore "pytest.mark.gpu" for RELAX tests * [#16914](https://github.com/apache/tvm/pull/16914) - [CMAKE] Make LOG_BEFORE_THROW explicit * [#16913](https://github.com/apache/tvm/pull/16913) - Enhance Release Note Script and Remove Useless File * [#16907](https://github.com/apache/tvm/pull/16907) - [Upd] Fixed lld search in rocm * [#16900](https://github.com/apache/tvm/pull/16900) - [CMAKE] Misc improvment of Util * [#16897](https://github.com/apache/tvm/pull/16897) - [Target] Don't register AArch64 target tags without LLVM compiler support * [#16892](https://github.com/apache/tvm/pull/16892) - [CUBLAS] Set fp32 compute and scale dtypes in fp16 matmul * [#16888](https://github.com/apache/tvm/pull/16888) - [CUBLAS][FP8] Support e4m3 gemm in cuBLAS BYOC * [#16887](https://github.com/apache/tvm/pull/16887) - [Contrib] Enable fp16 for thrust sort * [#16881](https://github.com/apache/tvm/pull/16881) - [release][Dont Squash] Update version to 0.16.0 and 0.17.0.dev on main branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
