junrushao created an issue (apache/tvm-ffi#459)

## Context

TVM-FFI currently provides two mechanisms for attaching behavior to types:

| Mechanism | Storage | Lookup | Inheritance | Designed For |
| --------- | ------- | ------ | ----------- | ------------ |
| **Member methods** | Row-major (per-type dict) | Name-based hash | Yes (base 
-> derived) | Per-type methods (`sum`, `append`, ...) |
| **TypeAttrColumn** | Column-major (dense array indexed by `type_index`) | 
O(1) pointer offset | No (per-type opt-in) | Builtin dunder dispatch 
(`__any_hash__`, `__ffi_repr__`, ...) |

TypeAttrColumn is currently a **closed set** — only a handful of builtin
attributes are registered as dense-array columns. The implicit rule is that type
attrs should be "rare and restrained by tvm-ffi package."

This document proposes making type attrs an **open set** backed by hash maps,
while preserving dense-array **fast paths** for the few builtins that genuinely 
need
them.

## Motivation: IR Trait Dispatch

Consider a compiler IR built on tvm-ffi. Every IR node can have **traits** —
static properties that optimization passes query:

- `SideEffectFree`: Can this node be eliminated if its result is unused?
- `Commutative`: Can operands be reordered?
- `HasCanonicalizer`: Does this node type provide a canonicalization rule?
- `Fusible`: Can this node be fused into a neighboring kernel?

These traits are queried across all node types during compilation:

```cpp
void DCE(const Object* node) {
  if (HasTrait(node, "side_effect_free")) {
    // safe to remove if unused
  }
}
```

Under the current closed-set policy, an IR framework cannot define new type
attrs — it must fall back to member method lookup or reinvent its own dispatch
table.

## Problem with Dense Arrays at Scale

The current TypeAttrColumn uses a dense array indexed by `type_index`. This
works well for a small fixed set of builtins where nearly every type *could*
have the attribute. But it does **not** scale to an open set:

- With 500 registered types and 100 user-defined trait columns, each dense
  array allocates 500 entries (8 bytes each) = **400 KB of mostly-null
  pointers**.
- As the type registry grows (downstream packages registering types), every
  existing column must be resized.
- The sparsity is wasteful: if `Commutative` only applies to 10 out of 500 op
  types, 98% of the column is null.

Dense arrays are the right choice for a small, fixed set of hot-path builtins.
They are the wrong choice for an open, extensible set of user-defined traits.

## Proposed Design: Two-Tier Type Attr Dispatch

### Tier 1: General type attrs via `unordered_map`

The default storage for type attrs is a hash map keyed by `(attr_name,
type_index)`. Any framework can register arbitrary type attrs without
pre-declaring columns:

```cpp
// IR framework registers traits (no EnsureTypeAttrColumn needed)
TypeAttrSet(AddOp::type_index, "ir_trait.commutative", true);
TypeAttrSet(AddOp::type_index, "ir_trait.side_effect_free", true);

// Query: hash map lookup (~20-50 cycles)
AnyView val = TypeAttrGet("ir_trait.commutative", node->type_index());
```

**Performance**: Hash map lookup is ~20-50 cycles. For IR trait queries that
run millions of times during compilation, this adds ~50 ms per 1M queries —
negligible compared to the IR transformations themselves. This is fast enough
for the vast majority of cross-type dispatch use cases.

**Memory**: Only entries that are actually set are stored. No pre-allocation,
no sparsity waste.

**Open set**: Any downstream framework can define new attrs with any string
key. No allowlist, no pre-registration.

### Tier 2: Specialized fast paths for hot builtins

For the few builtins that sit on genuinely critical paths — where every
object must be dispatched and the overhead compounds — the system provides
specialized fast paths:

| Builtin | Why it's hot | Specialization |
| ------- | ------------ | -------------- |
| `__any_hash__` | Called on every key in every Map operation | Dense array 
column (existing) |
| `__structural_equal__` | Called on every node pair in IR comparison | Dense 
array column (existing) |
| `__ffi_repr__` | Called once per object in repr | General tier is fine |
| `__ffi_shallow_copy__` | Called once per object in copy | General tier is 
fine |

The criterion for tier-2 specialization is empirical: **does hash-map overhead
show up in profiles?** Most dunder methods (repr, copy, serialization) are
called infrequently enough that the general tier suffices. Only a handful
(hash, equality) need the dense-array fast path.

This is analogous to how CPython works: most dunder methods go through
`tp_dict` (hash map), but the most critical ones (`tp_hash`, `tp_richcompare`,
`tp_iter`) have dedicated function pointer slots in the type struct.

### Implementation Sketch

```cpp
class TypeAttrRegistry {
 public:
  // Tier 1: general hash-map storage
  void Set(int32_t type_index, const std::string& attr_name, Any value) {
    general_[attr_name][type_index] = std::move(value);
  }

  AnyView Get(const std::string& attr_name, int32_t type_index) const {
    auto col_it = general_.find(attr_name);
    if (col_it == general_.end()) return AnyView();
    auto val_it = col_it->second.find(type_index);
    if (val_it == col_it->second.end()) return AnyView();
    return AnyView(val_it->second);
  }

  // Tier 2: dense-array fast path (opt-in per attr)
  // Existing TypeAttrColumn mechanism, unchanged.
  // Used only for __any_hash__, __structural_equal__, etc.

 private:
  // attr_name -> (type_index -> value)
  std::unordered_map<std::string, std::unordered_map<int32_t, Any>> general_;
};
```

For tier-2, the existing `TVMFFITypeAttrColumn` dense-array mechanism is
retained as-is. The two tiers coexist:

- Hot-path builtins are registered in *both* the dense column (for fast C++
  dispatch) and the general map (for uniform Python-side access via
  `getattr`).
- User-defined attrs go only in the general map.

### Optimization: Per-Column Hash Maps

If even the two-level hash lookup (`attr_name` then `type_index`) is too
slow for certain user-defined attrs, a framework can cache the inner map:

```cpp
// Cache the inner map once at startup
static auto* commutative_map = &TypeAttrRegistry::Get("ir_trait.commutative");

// Fast query: single hash lookup by type_index
bool is_commutative = commutative_map->count(node->type_index());
```

This gives ~20-cycle lookup (single `unordered_map` probe on a small integer
key) without requiring a dense array allocation for every registered type.

## What This Achieves

### For tvm-ffi core (Tianqi's concerns)

- **Hot-path builtins stay fast**: `__any_hash__` and `__structural_equal__`
  keep their dense-array O(1) lookup. Zero regression.
- **No sparsity waste**: User-defined attrs use hash maps, not dense arrays.
- **No ABI bloat**: The general tier is a pure C++ implementation detail.
  Dense-array columns remain the ABI for builtins.
- **Clear boundary**: Tier-2 specialization is driven by profiling data, not
  an arbitrary allowlist.

### For downstream frameworks (Junru's concerns)

- **Open set**: Any framework can define new type attrs with any string key.
  No gatekeeping.
- **Uniform API**: `TypeAttrGet(name, type_index)` works for both builtins and
  user-defined attrs. Python-side `getattr` works for everything.
- **No confusion**: Users don't need to know about the two tiers. They define
  attrs, and the system routes them to the right storage.
- **IR traits work**: `SideEffectFree`, `Commutative`, etc. can be defined as
  type attrs and queried efficiently across all node types.

## Python API

```python
@ffi.py_class
class MyIRNode(ffi.Object):
    # Trait: registered as general type attr
    ir_trait = SideEffectFree

    # Repr: registered as type attr (tier-1 general map is fine for repr)
    def __ffi_repr__(self, fn_print):
        return f"MyIRNode(...)"

    # Normal method: registered as member method (row store)
    def simplify(self):
        ...
```

The `@ffi.py_class` decorator routes:

- `__ffi_*__` methods -> type attrs (general tier; tier-2 for known hot
  builtins)
- Class-level trait assignments -> type attrs (general tier)
- Everything else -> member methods

## Comparison with CPython

CPython uses the same two-tier approach:

| Aspect | CPython | Proposed TVM-FFI |
| ------ | ------- | ---------------- |
| General methods | `tp_dict` (hash map) | General type attr map |
| Hot slots | `tp_hash`, `tp_richcompare`, `tp_iter`, ... | Dense-array columns 
for `__any_hash__`, ... |
| Criterion | Fixed set of ~40 type slots | Empirically profiled hot builtins |
| Open/closed | Open (any method in `tp_dict`) | Open (any attr in general map) 
|

## Summary

The current TypeAttrColumn system conflates two concerns:

1. **Open extensibility**: Downstream frameworks need to define new type attrs.
2. **Hot-path performance**: A few builtins need sub-10-cycle dispatch.

Dense arrays solve (2) but block (1). Hash maps solve (1) but are too slow for
(2). The solution is both:

- **General tier** (`unordered_map`): Open set, no sparsity, ~20-50 cycle
  lookup. Good enough for IR traits, repr, copy, serialization, and the vast
  majority of cross-type dispatch.
- **Specialized tier** (dense array): Closed set of profiled hot builtins.
  ~4 cycle lookup. Reserved for `__any_hash__`, `__structural_equal__`, and
  any future builtin that shows up in profiles.

This gives Tianqi's performance guarantee where it matters, and Junru's open
extensibility everywhere else.


-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-ffi/issues/459
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-ffi/issues/[email protected]>

Reply via email to