[PR] refactor(loader): support concurrent readers, short-id & Graphsrc [incubator-hugegraph-toolchain]

via GitHub Tue, 16 Sep 2025 04:33:57 -0700


sadwitdastreetz opened a new pull request, #683:
URL: https://github.com/apache/incubator-hugegraph-toolchain/pull/683


   ## Purpose of the PR
   
   This PR is a part of **updating HugegraphLoader to 2.0** and most 
importantly, it is NOT ready yet. It introduces a major refactor and 
enhancement to the HugeGraph Loader, aiming to improve **parallelism, 
stability, and compatibility** during data loading.  
   It includes:
   - More flexible file/HDFS source support
   - Stream-based JDBC fetching (especially for Oracle/MySQL large tables)
   - Short-ID support with schema proxy injection
   - Graph-to-Graph schema synchronization
   - Improved concurrency model for input progress tracking
   
   These changes address issues with performance bottlenecks, Kerberos token 
expiration, Oracle missing rows, and lack of schema compatibility when 
importing from another graph.
   
   ## Main Changes
   
   
     1. Concurrency Testing: Verify thread safety under high load
     2. Graph Source Testing: Validate Graph-to-Graph migration scenarios
     3. Error Recovery: Test failure handling under various error conditions
     4. Performance Benchmarking: Compare throughput with previous 
implementation
     5. Resource Leak Testing: Ensure proper cleanup under error conditions
     
   ### Loader
   
     Refactored HugeGraphLoader with concurrent loading, Graph source support, 
and improved error handling.
   
     Major Changes
   
     Concurrency
   
     - Added multi-threaded loading with ExecutorService and CompletableFuture
     - Configurable parallelism via parallelCount option
     - Scatter mode for distributed loading
   
     Graph Source Support
   
     - Added Graph-to-Graph data migration
     - Schema replication from source graphs
     - Vertex/edge label migration with property filtering
     - Index replication with field validation
   
     Schema Management
   
     - Automatic graph creation with backend configuration
     - Short ID configuration support
     - Enhanced Groovy script execution
   
     Error Handling
   
     - Improved ServerException handling
     - Better exception chain preservation
     - Modified exception propagation behavior
   
     API Changes
   
     - load() always returns true instead of context.noError()
     - Added checkGraphExists() and setGraphMode() methods
     - Commented out reader.confirmOffset() in loadStruct()
     - Simplified main() error handling
   
     Breaking Changes
   
     - Exception handling behavior modified
     - Resource cleanup sequence changed
     - Return value semantics changed in load()
   
   ### Source Layer
   - **FileSource**: added `dir_filter`, `extra_date_formats`, 
`headerCaseSensitive`, and `splitCount` for flexible directory/file filtering 
and single-file parallel reading.
   - **HDFSSource**: 
     - Kerberos auto-renewal via scheduled task.
     - Replaced prefix matching with `FileFilter + DirFilter`, supports 
recursive directory traversal.
     - Unified error handling with `LoadException`.
   - **GraphSource (new)**: sync schema directly from another HugeGraph 
instance.
   
   ### Reader Layer
   - **FileReader**: refactored `init()` and moved scan logic into `split()` → 
multiple sub-readers per file.
   - **HDFSFileReader**: fixed subdirectory traversal bug (using wrong 
variable), added `DirFilter`.
   - **JDBCReader**: replaced `RowFetcher` with streaming `JDBCFetcher` to 
avoid Oracle data loss and improve performance.
   - **FineLineFetcher**: added null checks to avoid NPE.
   
   ### Progress Layer
   - **InputProgress**: refactored from single `loadingItem` to `Map<String, 
InputItemProgress>` for multi-file concurrent tracking.  
     - Thread-safety improvements with synchronized maps.  
     - New `markLoaded(Readable, boolean)` API for fine-grained progress 
confirmation.
   
   ### Filter Layer
   - **ShortId support**:
     - Added `ShortIdParser`, `ShortIdConfig`.  
     - Schema enhancement via `SchemaManagerProxy` and `VertexLabelProxy` using 
reflection, injecting short-id handling transparently into HugeClient.
   
   ### Options
   - Extended `LoadOptions` with new cluster, graph, and loading optimization 
flags (`--scatter-sources`, `--short-id`, `--restore`, etc.).  
   - Added `dumpParams()` to log all runtime parameters.
   
   ### Others
   - Added `GlobalExecutorManager` for thread pool management.  
   - Updated `FileLoadTest` to adapt to `InputProgress` refactor.
   
   
   ## Does this PR potentially affect the following parts?
   
   - [x] Modify configurations (`LoadOptions` extended)
   - [x] The public API (Loader command line options & progress API)
   - [ ] Dependencies
   - [ ] Other impacts
   
   ## Documentation Status
   
   - [x] `Doc - TODO` (need to update loader usage doc)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hugegraph.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@hugegraph.apache.org
For additional commands, e-mail: issues-h...@hugegraph.apache.org

[PR] refactor(loader): support concurrent readers, short-id & Graphsrc [incubator-hugegraph-toolchain]

Reply via email to