jhrotko commented on PR #892:
URL: https://github.com/apache/arrow-java/pull/892#issuecomment-3501493422
# Design Analysis: Extension Type Writer Implementation
## Executive Summary
This PR represents a significant architectural improvement in how Arrow Java
handles extension type writers. The change moves from a **factory-based
pattern** to a **type-based pattern**, where the `ArrowType.ExtensionType`
itself provides the writer implementation. This simplifies the API, improves
extensibility, and makes it easier to implement custom extension types outside
of arrow-java.
---
## Problem Statement
### The Original Challenge
In Arrow's type system, each `MinorType` (INT, FLOAT, VARCHAR, etc.) has a
corresponding writer implementation. However, **extension types** are special:
- They all share the same `MinorType.EXTENSIONTYPE`
- But each extension type (UUID, Opaque, custom types) needs a **different
writer implementation**
- The system needed a way to map from a specific extension type to its writer
### Previous Solution: Factory Pattern
The original implementation (commits `34060eb4`, `7a7e4edd`, `7fe36d70`,
`8663ffc6`) used an `ExtensionTypeWriterFactory` pattern:
```java
// Old approach - Factory pattern
public interface ExtensionTypeWriterFactory {
FieldWriter createWriter(ValueVector vector);
}
// Usage in ComplexCopier
writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
writer.writeExtension(value);
```
**How it worked:**
1. Each extension type had a separate factory class (e.g.,
`UuidWriterFactory`)
2. The factory was passed around as a parameter
3. Writers stored factories and used them to create the appropriate writer
4. TransferPair implementations also needed to carry factories
---
## Problems with the Factory Pattern
### 1. **Scalability Issues**
- Every new extension type required creating a new factory class
- Factories had to be passed through multiple layers of the API
- TransferPair implementations became complex with factory parameters
### 2. **External Extension Types**
For developers implementing extension types **outside of arrow-java**:
```java
// Developer needs to create TWO classes:
class MyCustomType extends ExtensionType { ... }
class MyCustomWriterFactory implements ExtensionTypeWriterFactory { ... }
// And manage both separately
```
This created unnecessary complexity and coupling.
### 3. **API Complexity**
Methods proliferated with factory parameters:
```java
// Old API
ComplexCopier.copy(reader, writer, extensionTypeWriterFactory);
writer.addExtensionTypeWriterFactory(factory);
TransferPair.makeTransferPair(target, factory);
```
### 4. **Tight Coupling**
The factory pattern created tight coupling between:
- The type definition
- The writer implementation
- The factory that connects them
- All the code that needs to pass factories around
---
## New Solution: Type-Based Pattern
### Core Design Change
Add a single abstract method to `ArrowType.ExtensionType`:
```java
public abstract class ExtensionType extends ArrowType {
// NEW METHOD
public abstract FieldWriter getNewFieldWriter(ValueVector vector);
// Existing methods
public abstract ArrowType storageType();
public abstract String extensionName();
public abstract FieldVector getNewVector(...);
// ...
}
```
### Implementation Example
```java
public class UuidType extends ExtensionType {
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new UuidWriterImpl((UuidVector) vector);
}
// Other methods...
}
```
---
## Benefits of the New Design
### 1. **Simplicity**
- **One class per extension type** instead of two (type + factory)
- The type itself knows how to create its writer
- Natural object-oriented design
### 2. **Cleaner API**
```java
// New API - no factory parameters needed
ComplexCopier.copy(reader, writer);
writer.writeExtension(value, type);
// The type provides the writer internally
FieldWriter writer = extensionType.getNewFieldWriter(vector);
```
### 3. **Better Extensibility**
External developers only need to implement one class:
```java
public class MyCustomType extends ExtensionType {
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new MyCustomWriterImpl((MyCustomVector) vector);
}
// Implement other required methods...
}
```
### 4. **Consistency with Existing Patterns**
This mirrors how `MinorType` already works:
```java
// MinorType enum (existing pattern)
public enum MinorType {
INT(new Int(...)) {
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new IntWriterImpl((IntVector) vector);
}
},
// ...
}
// ExtensionType (new pattern - same approach)
public class UuidType extends ExtensionType {
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new UuidWriterImpl((UuidVector) vector);
}
}
```
### 5. **Decoupling**
- Writers no longer need to store or manage factories
- TransferPair implementations simplified
- Type information flows naturally through the `ArrowType` object
---
## Key Changes in the Codebase
### 1. **ArrowType.ExtensionType** (Core Change)
```java
// Added abstract method
public abstract FieldWriter getNewFieldWriter(ValueVector vector);
```
### 2. **Extension Type Implementations**
**UuidType:**
```java
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new UuidWriterImpl((UuidVector) vector);
}
```
**OpaqueType:**
```java
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
throw new UnsupportedOperationException("WriterImpl not yet
implemented.");
}
```
### 3. **UnionWriter** (Now Supports Extension Types)
```java
// OLD: Threw UnsupportedOperationException
private ExtensionWriter getExtensionWriter(ArrowType arrowType) {
throw new UnsupportedOperationException("ExtensionTypes are not
supported yet.");
}
// NEW: Uses type to get writer
private Map<ArrowType, ExtensionWriter> extensionWriters = new HashMap<>();
private ExtensionWriter getExtensionWriter(ArrowType arrowType) {
ExtensionWriter w = extensionWriters.get(arrowType);
if (w == null) {
w = ((ExtensionType)
arrowType).getNewFieldWriter(data.getExtension(arrowType));
w.setPosition(idx());
extensionWriters.put(arrowType, w);
}
return w;
}
```
### 4. **ComplexCopier** (Simplified)
```java
// OLD: Required factory parameter
case EXTENSIONTYPE:
if (extensionTypeWriterFactory == null) {
throw new IllegalArgumentException("Must provide
ExtensionTypeWriterFactory");
}
writer.addExtensionTypeWriterFactory(extensionTypeWriterFactory);
writer.writeExtension(value);
break;
// NEW: Type provides writer
case EXTENSIONTYPE:
if (reader.isSet()) {
Object value = reader.readObject();
if (value != null) {
writer.writeExtension(value);
}
}
break;
```
### 5. **PromotableWriter** (Simplified)
```java
// OLD: Used UnionExtensionWriter
case EXTENSIONTYPE:
writer = new UnionExtensionWriter((ExtensionTypeVector) vector);
break;
// NEW: Type provides writer
case EXTENSIONTYPE:
writer = ((ExtensionType)
vector.getField().getType()).getNewFieldWriter(vector);
break;
```
---
## Migration Impact
### For Arrow Java Developers
- **Removed:** `ExtensionTypeWriterFactory` interface and implementations
- **Removed:** Factory parameters from APIs
- **Added:** `getNewFieldWriter()` method to extension types
### For External Extension Type Developers
**Before:**
```java
// Two classes needed
class MyType extends ExtensionType { ... }
class MyWriterFactory implements ExtensionTypeWriterFactory { ... }
```
**After:**
```java
// One class needed
class MyType extends ExtensionType {
@Override
public FieldWriter getNewFieldWriter(ValueVector vector) {
return new MyWriterImpl((MyVector) vector);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]