sadwitdastreetz opened a new pull request, #683: URL: https://github.com/apache/incubator-hugegraph-toolchain/pull/683
## Purpose of the PR This PR is a part of **updating HugegraphLoader to 2.0** and most importantly, it is NOT ready yet. It introduces a major refactor and enhancement to the HugeGraph Loader, aiming to improve **parallelism, stability, and compatibility** during data loading. It includes: - More flexible file/HDFS source support - Stream-based JDBC fetching (especially for Oracle/MySQL large tables) - Short-ID support with schema proxy injection - Graph-to-Graph schema synchronization - Improved concurrency model for input progress tracking These changes address issues with performance bottlenecks, Kerberos token expiration, Oracle missing rows, and lack of schema compatibility when importing from another graph. ## Main Changes 1. Concurrency Testing: Verify thread safety under high load 2. Graph Source Testing: Validate Graph-to-Graph migration scenarios 3. Error Recovery: Test failure handling under various error conditions 4. Performance Benchmarking: Compare throughput with previous implementation 5. Resource Leak Testing: Ensure proper cleanup under error conditions ### Loader Refactored HugeGraphLoader with concurrent loading, Graph source support, and improved error handling. Major Changes Concurrency - Added multi-threaded loading with ExecutorService and CompletableFuture - Configurable parallelism via parallelCount option - Scatter mode for distributed loading Graph Source Support - Added Graph-to-Graph data migration - Schema replication from source graphs - Vertex/edge label migration with property filtering - Index replication with field validation Schema Management - Automatic graph creation with backend configuration - Short ID configuration support - Enhanced Groovy script execution Error Handling - Improved ServerException handling - Better exception chain preservation - Modified exception propagation behavior API Changes - load() always returns true instead of context.noError() - Added checkGraphExists() and setGraphMode() methods - Commented out reader.confirmOffset() in loadStruct() - Simplified main() error handling Breaking Changes - Exception handling behavior modified - Resource cleanup sequence changed - Return value semantics changed in load() ### Source Layer - **FileSource**: added `dir_filter`, `extra_date_formats`, `headerCaseSensitive`, and `splitCount` for flexible directory/file filtering and single-file parallel reading. - **HDFSSource**: - Kerberos auto-renewal via scheduled task. - Replaced prefix matching with `FileFilter + DirFilter`, supports recursive directory traversal. - Unified error handling with `LoadException`. - **GraphSource (new)**: sync schema directly from another HugeGraph instance. ### Reader Layer - **FileReader**: refactored `init()` and moved scan logic into `split()` → multiple sub-readers per file. - **HDFSFileReader**: fixed subdirectory traversal bug (using wrong variable), added `DirFilter`. - **JDBCReader**: replaced `RowFetcher` with streaming `JDBCFetcher` to avoid Oracle data loss and improve performance. - **FineLineFetcher**: added null checks to avoid NPE. ### Progress Layer - **InputProgress**: refactored from single `loadingItem` to `Map<String, InputItemProgress>` for multi-file concurrent tracking. - Thread-safety improvements with synchronized maps. - New `markLoaded(Readable, boolean)` API for fine-grained progress confirmation. ### Filter Layer - **ShortId support**: - Added `ShortIdParser`, `ShortIdConfig`. - Schema enhancement via `SchemaManagerProxy` and `VertexLabelProxy` using reflection, injecting short-id handling transparently into HugeClient. ### Options - Extended `LoadOptions` with new cluster, graph, and loading optimization flags (`--scatter-sources`, `--short-id`, `--restore`, etc.). - Added `dumpParams()` to log all runtime parameters. ### Others - Added `GlobalExecutorManager` for thread pool management. - Updated `FileLoadTest` to adapt to `InputProgress` refactor. ## Does this PR potentially affect the following parts? - [x] Modify configurations (`LoadOptions` extended) - [x] The public API (Loader command line options & progress API) - [ ] Dependencies - [ ] Other impacts ## Documentation Status - [x] `Doc - TODO` (need to update loader usage doc) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@hugegraph.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@hugegraph.apache.org For additional commands, e-mail: issues-h...@hugegraph.apache.org