Re: [I] [Feature] support write to array type [doris-spark-connector]

via GitHub Mon, 03 Nov 2025 14:39:06 -0800


kaori-seasons commented on issue #341:
URL: 
https://github.com/apache/doris-spark-connector/issues/341#issuecomment-3482896201


   The following is my proposed solution for this issue; I will begin 
implementing the interim plan.
   
   ## I. Current Status Assessment
   
   
   ### 1.1 Current Support
   
   **Write Path** supports three formats:
   
   1. **Arrow Format** (Recommended, fully supported)
      - `DorisArrayWriter` class implements Arrow serialization for arrays
      - Supports nested arrays through recursive `createFieldWriter` 
implementation
      - Schema conversion supports ArrayType to Arrow List )
   
   2. **CSV/JSON Format** (Limited support)
      - Arrays are converted to string representation like `[1,2,3]` 
   
   3. **Test Verification**
      - Complete integration tests covering all basic array types
   
   ### 1.2 Main Issues
   
   **Read Path Limitations**:
   - When reading from Doris, ARRAY type is mapped to `StringType` 
   - Arrow data is converted to JSON string via `toString()` 
   
   ## II. Detailed Development Plan
   
   
   ### 2.1 Mid-term Solution (Enhanced Read Support)
   
   #### Goal: Implement native ArrayType support in read path
   
   **Development Task List:**
   
   1. **Modify Schema Mapping Logic**
      - Update type mapping in `SchemaConvertors.scala`
      - Map ARRAY type to `ArrayType(elementType)` instead of `StringType`
      - Parse Doris array element type information [8](#1-7) 
   
   2. **Enhance RowBatch Conversion Logic**
      - Modify array processing logic in `RowBatch.java`
      - Recursively parse Arrow ListVector to native Spark array objects
      - Handle nested array cases [7](#1-6) 
   
   3. **Implement New Array Converter**
   ```java
   // Pseudo-code example
   private Object convertListVector(ListVector listVector, int rowIndex, 
DataType elementType) {
       if (listVector.isNull(rowIndex)) return null;
       
       List<Object> result = new ArrayList<>();
       UnionListReader reader = listVector.getReader();
       reader.setPosition(rowIndex);
       
       while (reader.next()) {
           Object element = convertElement(reader.reader(), elementType);
           result.add(element);
       }
       
       return result.toArray();
   }
   ```
   
   
   4. **Add Unit and Integration Tests**
      - Test read/write for various array types
      - Test nested arrays
      - Test null value handling
   <cite/>
   
   **Effort Estimation:**
   - Schema mapping refactoring: 2-3 days
   - RowBatch conversion logic: 3-5 days
   - Test case development: 2-3 days
   - Total: 7-11 days
   
   
   ### 2.2 Long-term Solution (Complete Complex Type Support)
   
   #### Goal: Unified complex type handling for read/write paths
   
   1. **Refactor Type System**
      - Establish unified Doris-Spark type mapping framework
      - Support arbitrarily nested complex types (Array, Map, Struct)
      - Support type inference and automatic conversion
   <cite/>
   
   2. **Optimize Arrow Processing**
      - Implement zero-copy Arrow data conversion
      - Optimize memory usage
   <cite/>
   
   3. **Enhance CSV/JSON Format Support**
      - Improve string serialization format for arrays
      - Add configuration options to control complex type serialization
   <cite/>
   
   ## III. Implementation Recommendations
   
   ### 3.1 Priority Ranking
   
   1. **P0 - Immediate Action**:
      - Improve documentation to clarify Arrow format support for array writing
      - Add more code examples and best practices
   <cite/>
   
   2. **P1 - Within 3 Months**:
      - Implement native ArrayType support in read path
      - Resolve read/write asymmetry
   <cite/>
   
   3. **P2 - Within 6 Months**:
      - Complete complex type system refactoring
      - Performance optimization
   <cite/>
   
   ### 3.2 Risk Assessment
   
   **Technical Risks:**
   - Arrow version compatibility issues
   - Differences in ARRAY type implementation across Doris versions
   - Performance regression risk
   
   
   **Mitigation Measures:**
   - Comprehensive compatibility testing
   - Performance benchmarking
   - Provide fallback options
   
   ### 3.3 Testing Strategy
   
   1. **Unit Tests**
      - Serialization/deserialization for various array types
      - Edge cases (empty arrays, null, large arrays)
      - Nested arrays
   
   
   2. **Integration Tests**
      - Compatibility with different Doris versions
      - Large data volume testing
      - Concurrent write testing
   <cite/>
   
   3. **Performance Tests**
      - Compare performance of CSV/JSON/Arrow formats
      - Memory usage analysis
      - Throughput testing
   <cite/>
   
   ## IV. Code Examples
   
   ### 4.1 Currently Available Solution
   
   ```scala
   // 1. Create DataFrame with arrays
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.Row
   
   val schema = StructType(Seq(
     StructField("id", IntegerType),
     StructField("arr_int", ArrayType(IntegerType)),
     StructField("arr_string", ArrayType(StringType)),
     StructField("arr_nested", ArrayType(ArrayType(IntegerType)))
   ))
   
   val data = Seq(
     Row(1, Array(1, 2, 3), Array("a", "b"), Array(Array(1, 2), Array(3, 4))),
     Row(2, Array(4, 5, 6), Array("c", "d"), Array(Array(5, 6), Array(7, 8)))
   )
   
   val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
   
   // 2. Write to Doris using Arrow format
   df.write
     .format("doris")
     .option("doris.fenodes", "localhost:8030")
     .option("doris.table.identifier", "test_db.test_array_table")
     .option("doris.user", "root")
     .option("doris.password", "")
     .option("doris.sink.properties.format", "arrow")
     .option("doris.sink.batch.size", "1024")
     .option("doris.sink.max-retries", "3")
     .save()
   ```
   
   ### 4.2 Doris Table Creation SQL
   
   ```sql
   CREATE TABLE test_db.test_array_table (
       id INT NOT NULL,
       arr_int ARRAY<INT>,
       arr_string ARRAY<STRING>,
       arr_nested ARRAY<ARRAY<INT>>
   ) ENGINE=OLAP
   DUPLICATE KEY(id)
   DISTRIBUTED BY HASH(id) BUCKETS 10
   PROPERTIES (
       "replication_num" = "1"
   );
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Feature] support write to array type [doris-spark-connector]

Reply via email to