This is an automated email from the ASF dual-hosted git repository.
binjieyang pushed a change to branch CELEBORN-1768
in repository https://gitbox.apache.org/repos/asf/celeborn.git
discard 52da4bd35 add apache license
discard 1671c86c2 fix doc style
discard becc6b065 [CELEBORN-1768][WRITER] Refactoring Shuffle Writer to
extract common methods
add 19fecadcd [CELEBORN-1413][FOLLOWUP] Bump zstd-jni version to 1.5.6-5
for 4.0.0-preview2
add 2962c1149 [CELEBORN-1823] Remove unused
remote-shuffle.job.min.memory-per-partition and
remote-shuffle.job.min.memory-per-gate
add b74e05b60 [CELEBORN-1821][CIP-14] Add controlMessages to cppClient
add df2512994 [CELEBORN-1482][CIP-8] Add partition meta handler
add 893a7449e [CELEBORN-1830] Chart statefulset resources key duplicate
add eb950c82e [CELEBORN-1827][CIP-14] Add messageDecoder to cppClient
add 45450e793 [CELEBORN-1832] MapPartitionData should create fixed thread
pool with registration of ThreadPoolSource
add ad933815b [CELEBORN-1720] Prevent stage re-run if another task attempt
is running or successful
add ac0d335f4 [CELEBORN-1831] Add ratis commitIndex metrics
add 35a14d246 [CELEBORN-1836][CIP-14] Add Message to cppClient
add f2751c280 [CELEBORN-1829] Replace waitThreadPoll's thread pool with
ScheduledExecutorService in Controller
add 6bd0cfe2f [CELEBORN-1720][FOLLOWUP] Fix compilation error of
CelebornTezReader for ShuffleClient#readPartition
add 30e46eee2 [CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9
add 9131c1e07 [CELEBORN-1792] MemoryManager resume should use
pinnedDirectMemory instead of usedDirectMemory
add 39a40dd2a [CELEBORN-1845][CIP-14] Add MessageDispatcher to cppClient
add f28ba6e72 [CELEBORN-1810] Using Operation description instead of
ApiResponse description for RESTful APIs
add a77a64b89 [CELEBORN-1835][CIP-8] Add tier writer base and memory tier
writer
add 75b697d81 [CELEBORN-1838] Interrupt spark task should not report fetch
failure
add fdf1883f2 [CELEBORN-1850] Setup worker endpoint after initalizing
controller
add 2b0a75587 [CELEBORN-1851] Disable spark ui when run celeborn spark-it
add e78c9b8ab [CELEBORN-1721][FOLLOWUP] Fix the problem of getting
partition location in ShuffleClientImpl during soft split
add 2e4f36f9d [CELEBORN-1792][FOLLOWUP] Suppress noisy logs when there is
no memory pressure
add b9e4bbb5a [MINOR] Change some config version
add 6f7647e4b [CELEBORN-1847][CIP-8] Introduce local and DFS tier writer
add 9f8a89e61 [CELEBORN-1841] Support custom implementation of
EventExecutorChooser to avoid deadlock when calling await in EventLoop thread
add 749a9798e [CELEBORN-1854] Change receive revive request log level to
debug
add 1455b6e2f [CELEBORN-1860] Remove unused
celeborn.<module>.io.enableVerboseMetrics option
add f9526021c [CELEBORN-1846] Fix the StreamHandler usage in fetching
chunk when task attempt is odd
add c45197c0c [CELEBORN-1843] Optimize roundrobin for more balanced disk
slot allocation
add b5c00ea64 [CELEBORN-1862] Bump Ratis version from 3.1.2 to 3.1.3
add 113c7eadb [CELEBORN-1863][CIP-14] Add TransportClient to cppClient
add 2dd26936e [CELEBORN-1864] Bump Netty version from 4.1.115.Final to
4.1.118.Final
add 6a836f952 [CELEBORN-1859] DfsPartitionReader and LocalPartitionReader
should reuse pbStreamHandlers get from BatchOpenStream request
add fc459c0f7 [CELEBORN-1757] Add retry when sending RPC to
LifecycleManager
add 5b507aed7 [CELEBORN-1872] Bump Flink from 1.19.1, 1.20.0 to 1.19.2,
1.20.1
add f1eec656f [CELEBORN-1866][FLINK] Fix CelebornChannelBufferReader
request more buffers than needed
add 2097fcdfe [CELEBORN-1870] Fix typos in in 'Developer' documents
add 7ca69e200 [CELEBORN-1861] Support
celeborn.worker.storage.baseDir.diskType option to specify disk type of base
directory for worker
add d659e06d4 [CELEBORN-1319] Optimize skew partition logic for Reduce
Mode to avoid sorting shuffle files
add 27c6605c4 [CELEBORN-1865] Update master endpointRef when master leader
is abnormal
add f09482108 [CELEBORN-1867][FLINK] Fix flink client memory leak of
TransportResponseHandler#outstandingRpcs for handling addCredit and
notifyRequiredSegment response
add e244d01af [CELEBORN-1876] Log remote address on RPC exception for
TransportRequestHandler
add 8c04e5e8a [CELEBORN-1871][CIP-14] Add NettyRpcEndpointRef to cppClient
add 79b49805e [CELEBORN-1877] Bump zstd-jni version from 1.5.2-1 to 1.5.7-1
add fc056a3c3 [CELEBORN-1875] Support to get workers topology information
with RESTful api
add e10aefd04 [CELEBORN-1857] Support LocalPartitionReader read partition
by chunkOffsets when enable optimize skew partition read
add 18655e286 [CELEBORN-1879] Ignore invalid chunk range generated by
splitSkewedPartitionLocations
add ef30c2391 [CELEBORN-1858] Support DfsPartitionReader read partition by
chunkOffsets when enable optimize skew partition read
add a4ce369ee [CELEBORN-1883] Replace HashSet with
ConcurrentHashMap.newKeySet for ShuffleFileGroups
add cc501928c [CELEBORN-1875][FOLLOWUP] Support master
--show-workers-topology command to show registered workers topology
add 44d772df7 [CELEBORN-1882] Support configuring the SSL handshake
timeout for SSLHandler
add d90cf0d42 [CELEBORN-1884] Bump rocksdbjni version from 9.5.2 to 9.10.0
add 660cf24de [CELEBORN-1319][FOLLOWUP] Fix IndexOutOfBoundsException when
using old celeborn client
add 3a83ac769 [CELEBORN-1889] Fix scala 2.13 complie error
add 15e34eca6 [CELEBORN-1890] Bump Spark from 3.5.4 to 3.5.5
add e85207e2c [CELEBORN-1413][FOLLOWUP] Rename `celeborn-client-spark-3-4`
back to `celeborn-client-spark-3`
add fa4327e09 [CELEBORN-1885] Fix nullptr exceptions in FetchChunk after
worker restart
add 196ad607c [CELEBORN-1792][FOLLOWUP] Keep resume for a while after
resumeByPinnedMemory
add 6f5ad2dde [MINOR] Refine the log for fetch failure and rpc metrics dump
add 3d05c8998 [CELEBORN-1895] Bump log4j2 version to 2.24.3
add 18b268d08 [CELEBORN-1897] Avoid calling toString for too long messages
add df809159d [CELEBORN-1898] SparkOutOfMemoryError compatible with Spark
4.0 and 4.1
add 595ab41f5 [CELEBORN-1881][CIP-14] Add WorkerPartitionReader to
cppClient
add 05b6ad4a7 [MINOR] Change config versions
add 8d6c49aed [CELEBORN-1901] updated base docker image tag
add 464a3842e [CELEBORN-1899] Fix configuration bug in shuffle s3
add c1fb94d6e [CELEBORN-1910] Remove redundant synchronized of
isTerminated in ThreadUtils#sameThreadExecutorService
add b5fab4260 [CELEBORN-1822] Respond to RegisterShuffle with max epoch
PartitionLocation to avoid revive
add a5214e253 [CELEBORN-1906][CIP-14] Add CelebornInputStream to cppClient
add d96457909 [CELEBORN-1911] Move multipart-uploader to
multipart-uploader/multipart-uploader-s3 for extensibility
add 38f3bdd37 [CELEBORN-1909] Support pre-run static code blocks of
TransportMessages to improve performance of protobuf serialization
add 7571e10ad [CELEBORN-1894] Allow skipping already read chunks during
unreplicated shuffle read retried
add 717427553 [CELEBORN-1914] incWriteTime when ShuffleWriter invoke
pushGiantRecord
add 9e8f9f6b1 [CELEBORN-1900] docker images
add 7b73f5917 [CELEBORN-1678][FOLLOWUP] Update master and worker commands
in celeborn_cli.md
add 151fd3567 [CELEBORN-1923] Correct Celeborn available slots calculation
logic
add 192213daf [CELEBORN-1900][FOLLOWUP] fixed wrong CI parameter
add 4bacd1f21 [CELEBORN-1856] Support stage-rerun when read partition by
chunkOffsets when enable optimize skew partition read
add 0a97ca0aa [CELEBORN-1577][PHASE2] QuotaManager should support
interrupt shuffle
add 9bae3fbd5 [CELEBORN-1915][CIP-14] Add reader's ShuffleClient to
cppClient
add d5645a98d [CELEBORN-1900][FOLLOWUP] use github actions to login to
docker hub
add 5f298f5ce [CELEBORN-1190][FOLLOWUP] Use
-XepDisableWarningsInGeneratedCode to disable warnings for openapi-client module
add 15ea5d366 [CELEBORN-1930][CIP-12] Support HARD_SPLIT in PushMergedData
should handle congestion control NPE issue
add 3cf1802e7 [CELEBORN-1928][CIP-12] Support HARD_SPLIT in PushMergedData
should support handle older worker success response
add f1c963d0b [CELEBORN-1947] Reduce log for CelebornShuffleReader
sleeping before inputStream ready
add 56bf87d3c [CELEBORN-1543][FOLLOWUP] celeborn-flink-it project should
set FLINK_VERSION environment variable for HybridShuffleWordCountTest
add d8495e5b6 [CELEBORN-1532][HELM] Read log4j2 and metrics configurations
from file
add 303894223 [CELEBORN-1900][FOLLOWUP] push celeborn docker image
add 99ca4dffe [CELEBORN-1918] Add batchOpenStream time to fetch wait time
add 193dc6cf8 [CELEBORN-1929] Avoid unnecessary buffer loss to get better
buffer reusability
add 5adce2b40 [CELEBORN-1949] Add a labeler github action to triage PRs
add 1e30f159b [CELEBORN-1577][FOLLOWUP] Add UpdateResourceConsumptionTime
timer and prevent NPE if metrics not found
add 5e12b7d60 [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse
to prevent Spark driver network exhausted
add 6e5bd2403 [CELEBORN-1952][HELM] Define template helpers for
master/worker respectively
add 8dbbebc64 [CELEBORN-1954][HELM] Add a new value image.registry
add 951b626a9 [CELEBORN-1844][CIP-8] introduce tier writer proxy and
simplify partition data writer
add 621afaa5d [CELEBORN-1949][FOLLOWUP] Fix typo for `kind:deploy` label
add 0d923c37b [CELEBORN-1956] Forward GitHub discussion to ASF mailing list
add dfeaef135 [MINOR] Add spec link to JavaSerializer
add 30850a658 [CELEBORN-1932][CIP-14] Adapt java's serialization to
support cpp serialization for GetReducerFileGroup/Response
add 2b8f3520f [CELEBORN-1925] Support Flink 2.0
add 7d0ba7f9b [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension
Interface
add e5ccc9b62 [CELEBORN-81][FOLLOWUP] Correct scala test plugin args
add 91814602f [CELEBORN-1646][FOLLOWUP] DeviceMonitor should
notifyObserversOnError with CRITICAL_ERROR disk status for input/ouput error
add 529fd6e01 [MINOR] Avoid use `_$eq` in Scala file
add f92f9b84a [CELEBORN-1856][FOLLOWUP] Check `isCelebornSkewedShuffle `
before `registerCelebornSkewedShuffle` for stage rollback
add c2015a24e [CELEBORN-1953][HELM] Split podAnnotations into
master.annotations and worker.annotations
add a56867253 [CELEBORN-1874][CIP-14] Add CICD procedure in github action
for cppClient
add d93b927d2 [CELEBORN-1955][HELM] Split nodeSelector into
master.nodeSelector and worker.nodeSelector
add 44e2c793e [CELEBORN-1958][CIP-14] Add testsuite to test writing with
javaClient and reading with cppClient
add 99af30117 [CELEBORN-1970] Use StatusCode.fromValue instead of
Utils.toStatusCode
add df568945c [CELEBORN-1962][HELM] Split tolerations into
master.tolerations and worker.tolerations
add b8bb6c1e6 [CELEBORN-1931] use gather API for local flusher to optimize
write io pattern
add f1b71e3eb [CELEBORN-1436][FOLLOWUP] Add swagger editor links for
RESTful spec
add 95f0acfbd [CELEBORN-1961] Convert Resource.proto from Protocol Buffers
version 2 to version 3
add 553b3abc3 [CELEBORN-1969] Remove
celeborn.client.shuffle.mapPartition.split.enabled to enable shuffle partition
split at default for MapPartition
add fdc65fa3a [CELEBORN-1966] Added fix to get userinfo in celeborn
add 7be4d6b0a [CELEBORN-1951][HELM] Rename resources.{master,worker} to
{master,worker}.resoruces
add c9267ce9a [CELEBORN-1976] CommitHandler should use
celeborn.client.rpc.commitFiles.askTimeout for timeout of doParallelCommitFiles
add 5dc79a635 [CELEBORN-1983] Fix fetch fail not throw due to reach spark
maxTaskFailures
add 714722b5d [CELEBORN-1982] Slot Selection Perf Improvements
add 8b2f50ab1 [CELEBORN-1972][ ][HELM] Rename affinity.{master,worker} to
{master,worker}.affinity
add 725e3ce9f [CELEBORN-1980][HELM] Split environments into master.env and
worker.env
add ec4ea9773 [CELEBORN-1979] Change partition manager should respect the
`celeborn.storage.availableTypes`
add 937561f3c [CELEBORN-1919] Hardsplit batch tracking should be disabled
when pushing only a single replica
add e89ebe1e8 [CELEBORN-1977] Add help/type on prometheus exposed metrics
add 3db997496 [CELEBORN-1986][HELM] Rename priorityClass.{master,worker}
to {master,worker}.priorityClass
add f22e8ee7d [CELEBORN-1985][HELM] Add new values master.envFrom and
worker.envFrom
add a06362259 [MINOR][INFRA] Do not cancel GHA jobs on committing to
main/branch-* branches
add 76d861799 [CELEBORN-1981][HELM] Rename masterReplicas and
workerReplicas to master.replicas and worker.replicas
add a2110568f [CELEBORN-1501][FOLLOWUP] Add bytes written threshold for
top app consumption metrics
add 88661c2c6 [CELEBORN-1992] Ensure hadoop FS are not closed by hadoop
ShutdownHookManager
add 9dd6587d1 [CELEBORN-1912] Client should send heartbeat to worker for
processing heartbeat to avoid reading idleness of worker which enables heartbeat
add 54732c7b3 Update celeborn conf to add S3 in default and doc for policy
add fff97252a [CELEBORN-1760][FOLLOWUP] Remove redundant release on data
added in flushBuffer
add 74b41bb39 [CELEBORN-1319][CELEBORN-474][FOLLOWUP] PushState uses
JavaUtils#newConcurrentHashMap to speed up ConcurrentHashMap#computeIfAbsent
add b2c62d406 [CELEBORN-1987][HELM] Split dnsPolicy into master.dnsPolicy
and worker.dnsPolicy
add 7542adf70 [CELEBORN-1948] Fix the issue where replica may lose data
when HARD_SPLIT occurs during handlePushMergeData
add 06bcc207d [CELEBORN-1988][HELM] Split hostNetwork into
master.hostNetwork and worker.hostNetwork
add c9ca90c5e [CELEBORN-1965] Rely on all default hadoop providers for S3
auth
add 6ceadd3b2 [CELEBORN-1487][FOLLOWUP] Fix updateProduceBytes
add 3896249b9 [CELEBORN-1978][CIP-14] Add code style checking for cppClient
add 9ba54b39e [CELEBORN-1968] Publish metric for unreleased partition
location count when worker was gracefully shutdown
add a547cdaef [CELEBORN-1974] ApplicationId as metrics label should be
behind a config flag
add 045411ac3 [CELEBORN-1855] LifecycleManager return appshuffleId for non
barrier stage when fetch fail has been reported
add eb2449ced [CELEBORN-1989][HELM] Split securityContext into
master.podSecurityContext and worker.podSecurityContext
add 8e66ac833 [CELEBORN-1994] Introduce disruptor dependency to support
asynchronous logging of log4j2
add d03efcbdb [CELEBORN-1999] OpenStreamTime should use requestId to
record cost time
add a9ce4113a [CELEBORN-1998] RemoteShuffleEnvironment should not register
InputChannelMetrics repeatedly
add 88124d763 [CELEBORN-1691][FOLLOWUP] Fix the issue that upstream tasks
don't rerun and the current task still retry when failed to deserialize in flink
add e8ae23bc7 [CELEBORN-1960] Fix PauseSpentTime only append the interval
check time
add 062db5b38 [CELEBORN-1921][FOLLOWUP] Log the
GetReducerFileGroupResponse size to provide insights
add 4205f83da [CELEBORN-1995] Optimize memory usage for push failed batches
add d9984c9e0 [CELEBORN-1800] Introduce ApplicationTotalCount and
ApplicationFallbackCount metric to record the total and fallback count of
application
add ec62d924c [CELEBORN-2000] Ignore the getReducerFileGroup timeout
before shuffle stage end
add fd715b41a [CELEBORN-1993] CelebornConf introduces
celeborn.<module>.io.threads to specify number of threads used in the client
thread pool
add 90ece9665 [CELEBORN-2002][MASTER] Audit shuffle lifecycle in separate
log file
add a7e638706 [CELEBORN-2004] Filter empty partition before
createIntputStream
add 0b5a09a9f [CELEBORN-1896] delete data from failed to fetch shuffles
add 46d9d63e1 [CELEBORN-1916][FOLLOWUP] Improve Aliyun OSS support
add 45b94bf05 [CELEBORN-1996][HELM] Rename volumes.{master,worker} to
{master,worker}.volumes and {master.worker}.volumeMounts
add 082f0dd8c [CELEBORN-1775][FOLLOWUP] Improve logging around commit files
add 2a847ba90 [MINOR] Change some config version
add f7be34194 [CELEBORN-1902] Read client throws
PartitionConnectionException
add cbf4a145c Bump 0.7.0-SNAPSHOT
add d2befe033 [CELEBORN-2008] SlotsAllocator should select disks randomly
in RoundRobin mode
add 634343e20 [CELEBORN-2007] Reduce PartitionLocation memory usage
add a554261f3 [CELEBORN-2006] LifecycleManager should avoid parsing
shufflePartitionType every time
add 11ca1a785 [CELEBORN-2005] Introduce numBytesIn, numBytesOut,
numBytesInPerSecond, numBytesOutPerSecond metrics for
RemoteShuffleServiceFactory
add 81c3d91f7 [CELEBORN-2010][INFRA] Add release guide
add 637c42338 [CELEBORN-2010][FOLLOWUP] Fix svn staging dir
add 48fb71ee7 [CELEBORN-2011][INFRA] Add a script to simplify the process
of creating release notes
add d65ff5649 [CELEBORN-2012] Add license for http5
add 0dffcf6c9 [CELEBORN-2013] Upgrade scala binary version of spark-3.3,
spark-3.4, spark-3.5 profile to 2.13.8
add 14d721212 [MINOR][DOC] Correct configuration values in
slotsallocation
add 612464c69 [CELEBORN-2015] Retry IOException failures for RPC requests
add c83d498d9 [CELEBORN-1528][HELM] Use volume claim template to support
various storage backend
add aeac31f6f [CELEBORN-2009] Commit files request failure should exclude
worker in LifecycleManager
add b44730771 [CELEBORN-1413][FOLLOWUP] Bump spark 4.0 version to 4.0.0
add 3fb6d5b82 [CELEBORN-1413][FOLLOWUP] Support dependencies of spark-4.0
profile
add 0227a1ab2 [CELEBORN-1627][FOLLOWUP] Fix the issue where the case of
name affects the metrics dashboard
add 68a1db1e3 [CELEBORN-2005][FOLLOWUP] Introduce ShuffleMetricGroup for
numBytesIn, numBytesOut, numRecordsOut, numBytesInPerSecond,
numBytesOutPerSecond, numRecordsOutPerSecond metrics
add 5f58fb1e3 [CELEBORN-2020] Support http authentication for Celeborn CLI
add aceee64c7 [CELEBORN-2018] Support min number of workers selected for
shuffle
This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version. This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:
* -- * -- B -- O -- O -- O (52da4bd35)
\
N -- N -- N refs/heads/CELEBORN-1768 (aceee64c7)
You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.
Any revisions marked "omit" are not gone; other references still
refer to them. Any revisions marked "discard" are gone forever.
No new revisions were added by this update.
Summary of changes:
.asf.yaml | 1 +
.github/labeler.yml | 164 +++++
.github/workflows/cpp_integration.yml | 94 +++
.github/workflows/deps.yml | 42 +-
.github/workflows/docker-build.yml | 46 ++
.github/workflows/labeler.yml | 33 +
.github/workflows/license.yml | 3 +-
.github/workflows/maven.yml | 79 ++-
.github/workflows/sbt.yml | 74 ++-
.github/workflows/style.yml | 3 +-
.github/workflows/web_lint.yml | 2 +-
.gitignore | 1 +
CONTRIBUTING.md | 3 +
LICENSE | 1 +
LICENSE-binary | 8 +
README.md | 44 +-
assets/grafana/celeborn-dashboard.json | 465 ++++++++++++-
...eleborn-Optimize-Skew-Partitions-spark3_2.patch | 379 +++++++++++
...eleborn-Optimize-Skew-Partitions-spark3_3.patch | 376 +++++++++++
...eleborn-Optimize-Skew-Partitions-spark3_4.patch | 376 +++++++++++
...eleborn-Optimize-Skew-Partitions-spark3_5.patch | 388 +++++++++++
build/make-distribution.sh | 17 +
build/release/known_translations | 64 ++
build/release/pre_gen_release_notes.py | 236 +++++++
build/release/release.sh | 19 +-
build/release/release_utils.py | 159 +++++
charts/celeborn/ci/values.yaml | 180 +++---
charts/celeborn/files/conf/log4j2.xml | 85 +++
.../celeborn/files/conf/metrics.properties | 1 -
charts/celeborn/templates/_helpers.tpl | 144 +----
charts/celeborn/templates/configmap.yaml | 119 +---
charts/celeborn/templates/master/_helpers.tpl | 91 +++
charts/celeborn/templates/master/podmonitor.yaml | 7 +-
.../celeborn/templates/master/priorityclass.yaml | 8 +-
charts/celeborn/templates/master/service.yaml | 15 +-
charts/celeborn/templates/master/statefulset.yaml | 140 ++--
charts/celeborn/templates/worker/_helpers.tpl | 113 ++++
charts/celeborn/templates/worker/podmonitor.yaml | 7 +-
.../celeborn/templates/worker/priorityclass.yaml | 8 +-
charts/celeborn/templates/worker/service.yaml | 11 +-
charts/celeborn/templates/worker/statefulset.yaml | 145 +++--
charts/celeborn/tests/configmap_test.yaml | 2 -
.../celeborn/tests/master/priorityclass_test.yaml | 10 +-
charts/celeborn/tests/master/service_test.yaml | 6 +-
charts/celeborn/tests/master/statefulset_test.yaml | 312 +++++++--
.../celeborn/tests/worker/priorityclass_test.yaml | 10 +-
charts/celeborn/tests/worker/statefulset_test.yaml | 311 +++++++--
charts/celeborn/values.yaml | 486 +++++++++-----
cli/pom.xml | 7 +
.../apache/celeborn/cli/common/CommonOptions.scala | 17 +
.../apache/celeborn/cli/master/MasterOptions.scala | 5 +
.../celeborn/cli/master/MasterSubcommandImpl.scala | 42 +-
.../celeborn/cli/worker/WorkerSubcommandImpl.scala | 26 +-
.../celeborn/cli/TestCelebornCliCommands.scala | 34 +-
.../flink/AbstractRemoteShuffleInputGate.java | 24 +-
.../AbstractRemoteShuffleInputGateFactory.java | 18 +-
...bstractRemoteShuffleResultPartitionFactory.java | 21 +-
.../flink/AbstractRemoteShuffleServiceFactory.java | 9 +-
.../plugin/flink/RemoteShuffleEnvironment.java | 47 +-
.../flink/RemoteShuffleInputGateDelegation.java | 43 +-
.../celeborn/plugin/flink/RemoteShuffleMaster.java | 11 +-
.../plugin/flink/RemoteShuffleOutputGate.java | 45 +-
.../RemoteShuffleResultPartitionDelegation.java | 8 +
.../flink/client/FlinkShuffleClientImpl.java | 18 +-
.../celeborn/plugin/flink/utils/FlinkUtils.java | 2 -
.../metrics/dump/ShuffleQueryScopeInfo.java | 64 ++
.../metrics/groups/ShuffleIOMetricGroup.java | 92 +++
.../runtime/metrics/groups/ShuffleMetricGroup.java | 114 ++++
.../runtime/metrics/scope/ShuffleScopeFormat.java | 78 +++
.../flink/RemoteShuffleOutputGateSuiteJ.java | 57 +-
.../plugin/flink/RemoteShuffleInputGate.java | 21 +-
.../flink/RemoteShuffleInputGateFactory.java | 15 +-
.../flink/RemoteShuffleResultPartitionFactory.java | 7 +-
.../plugin/flink/RemoteShuffleServiceFactory.java | 11 +
.../flink/RemoteShuffleResultPartitionSuiteJ.java | 58 +-
.../plugin/flink/RemoteShuffleInputGate.java | 21 +-
.../flink/RemoteShuffleInputGateFactory.java | 15 +-
.../flink/RemoteShuffleResultPartitionFactory.java | 7 +-
.../plugin/flink/RemoteShuffleServiceFactory.java | 11 +
.../flink/RemoteShuffleResultPartitionSuiteJ.java | 22 +-
.../plugin/flink/RemoteShuffleInputGate.java | 21 +-
.../flink/RemoteShuffleInputGateFactory.java | 15 +-
.../flink/RemoteShuffleResultPartitionFactory.java | 7 +-
.../plugin/flink/RemoteShuffleServiceFactory.java | 11 +
.../flink/RemoteShuffleResultPartitionSuiteJ.java | 22 +-
.../plugin/flink/RemoteShuffleInputGate.java | 21 +-
.../flink/RemoteShuffleInputGateFactory.java | 15 +-
.../flink/RemoteShuffleResultPartitionFactory.java | 7 +-
.../plugin/flink/RemoteShuffleServiceFactory.java | 11 +
.../flink/RemoteShuffleResultPartitionSuiteJ.java | 22 +-
.../plugin/flink/RemoteShuffleInputGate.java | 21 +-
.../flink/RemoteShuffleInputGateFactory.java | 15 +-
.../flink/RemoteShuffleResultPartitionFactory.java | 7 +-
.../plugin/flink/RemoteShuffleServiceFactory.java | 9 +
.../flink/tiered/CelebornChannelBufferManager.java | 1 +
.../flink/tiered/CelebornChannelBufferReader.java | 13 +-
.../flink/RemoteShuffleResultPartitionSuiteJ.java | 22 +-
.../flink-2.0-shaded}/pom.xml | 6 +-
.../src/main/resources/META-INF/LICENSE | 0
.../src/main/resources/META-INF/NOTICE | 0
.../META-INF/licenses/LICENSE-protobuf.txt | 0
client-flink/{flink-1.17 => flink-2.0}/pom.xml | 4 +-
.../plugin/flink/RemoteShuffleInputGate.java | 21 +-
.../flink/RemoteShuffleInputGateFactory.java | 15 +-
.../plugin/flink/RemoteShuffleResultPartition.java | 0
.../flink/RemoteShuffleResultPartitionFactory.java | 7 +-
.../plugin/flink/RemoteShuffleServiceFactory.java | 9 +
.../flink/tiered/CelebornChannelBufferManager.java | 1 +
.../flink/tiered/CelebornChannelBufferReader.java | 13 +-
.../flink/tiered/CelebornTierConsumerAgent.java | 0
.../plugin/flink/tiered/CelebornTierFactory.java | 11 +-
.../flink/tiered/CelebornTierMasterAgent.java | 2 +-
.../flink/tiered/CelebornTierProducerAgent.java | 0
.../flink/tiered/TierShuffleDescriptorImpl.java | 0
.../plugin/flink/RemoteShuffleMasterSuiteJ.java | 2 +-
.../RemoteShuffleResultPartitionFactorySuiteJ.java | 0
.../flink/RemoteShuffleResultPartitionSuiteJ.java | 22 +-
.../flink/RemoteShuffleServiceFactorySuiteJ.java | 0
.../plugin/flink/ShuffleResourceTrackerSuiteJ.java | 0
.../tiered/CelebornTierMasterAgentSuiteJ.java | 11 +-
client-mr/mr/pom.xml | 4 +
.../task/reduce/CelebornShuffleConsumer.java | 1 +
.../shuffle/celeborn/ShuffleInMemorySorter.java | 3 +-
.../spark/shuffle/celeborn/SparkCommonUtils.java | 48 ++
.../celeborn/spark/FailedShuffleCleaner.scala | 93 +++
.../shuffle/celeborn/SparkCommonUtilsSuiteJ.java} | 15 +-
client-spark/spark-2-shaded/pom.xml | 1 +
.../shuffle/celeborn/SparkShuffleManager.java | 15 +
.../apache/spark/shuffle/celeborn/SparkUtils.java | 301 ++++++++-
.../spark/celeborn/ExceptionMakerHelper.scala | 2 +-
.../CelebornShuffleFallbackPolicyRunner.scala | 31 +-
.../shuffle/celeborn/CelebornShuffleReader.scala | 44 +-
...uffleFetchFailureReportTaskCleanListener.scala} | 18 +-
.../spark/shuffle/celeborn/BasedShuffleWriter.java | 239 -------
.../shuffle/celeborn/SortBasedShuffleWriter.java | 210 ------
.../apache/spark/shuffle/celeborn/SparkUtils.java | 322 ---------
client-spark/spark-3-columnar-common/pom.xml | 2 +-
client-spark/spark-3-columnar-shuffle/pom.xml | 12 +-
.../celeborn/ColumnarHashBasedShuffleWriter.java | 2 +-
client-spark/spark-3-shaded/pom.xml | 3 +-
client-spark/{spark-3-4 => spark-3}/pom.xml | 17 +-
.../shuffle/celeborn/CelebornPartitionUtil.java | 100 +++
.../shuffle/celeborn/CelebornShuffleDataIO.java | 0
.../shuffle/celeborn/HashBasedShuffleWriter.java | 214 +++++-
.../shuffle/celeborn/SortBasedShuffleWriter.java | 218 ++++---
.../shuffle/celeborn/SparkShuffleManager.java | 50 +-
.../apache/spark/shuffle/celeborn/SparkUtils.java | 636 ++++++++++++++++++
.../scala/org/apache/spark/SparkVersionUtil.scala | 0
.../spark/celeborn/ExceptionMakerHelper.scala | 0
.../CelebornShuffleFallbackPolicyRunner.scala | 28 +-
.../shuffle/celeborn/CelebornShuffleHandle.scala | 0
.../shuffle/celeborn/CelebornShuffleReader.scala | 268 +++++---
...uffleFetchFailureReportTaskCleanListener.scala} | 18 +-
.../celeborn/CelebornPartitionUtilSuiteJ.java | 189 ++++++
.../celeborn/CelebornShuffleWriterSuiteBase.java | 1 +
.../celeborn/HashBasedShuffleWriterSuiteJ.java | 0
.../spark/shuffle/celeborn/ShuffleManagerHook.java | 0
.../celeborn/SortBasedShuffleWriterSuiteJ.java | 17 +-
.../celeborn/TestCelebornShuffleManager.java | 0
.../src/test/resources/log4j.properties | 0
.../src/test/resources/log4j2-test.xml | 0
.../celeborn/CelebornShuffleManagerSuite.scala | 0
.../celeborn/CelebornShuffleReaderSuite.scala | 95 +++
.../execution/columnar/CelebornColumnType.scala | 2 +-
client-spark/spark-4-shaded/pom.xml | 2 +-
client-tez/tez/pom.xml | 4 +
.../apache/celeborn/client/CelebornTezReader.java | 2 +-
.../apache/celeborn/client/DummyShuffleClient.java | 22 +-
.../org/apache/celeborn/client/ShuffleClient.java | 41 +-
.../apache/celeborn/client/ShuffleClientImpl.java | 254 +++++---
.../celeborn/client/read/CelebornInputStream.java | 259 +++++++-
.../celeborn/client/read/DfsPartitionReader.java | 126 +++-
.../celeborn/client/read/LocalPartitionReader.java | 68 +-
.../celeborn/client/read/MetricsCallback.java | 2 +
.../celeborn/client/read/PartitionReader.java | 4 +
.../client/read/WorkerPartitionReader.java | 89 ++-
.../PartitionReaderCheckpointMetadata.java | 27 +-
.../celeborn/client/ApplicationHeartbeater.scala | 17 +-
.../org/apache/celeborn/client/ClientUtils.scala | 18 +
.../org/apache/celeborn/client/CommitManager.scala | 22 +-
.../apache/celeborn/client/LifecycleManager.scala | 262 ++++++--
.../celeborn/client/WorkerStatusTracker.scala | 2 +
.../celeborn/client/commit/CommitHandler.scala | 53 +-
.../client/commit/MapPartitionCommitHandler.scala | 11 +-
.../commit/ReducePartitionCommitHandler.scala | 135 +++-
.../celeborn/client/ShuffleClientSuiteJ.java | 226 ++++++-
.../celeborn/client/WithShuffleClientSuite.scala | 12 +-
.../celeborn/common/client/MasterClient.java | 1 +
.../apache/celeborn/common/meta/DiskFileInfo.java | 6 +-
.../celeborn/common/network/TransportContext.java | 3 +-
.../network/client/TransportClientFactory.java | 9 +-
.../protocol/{Heartbeat.java => SerdeVersion.java} | 34 +-
.../common/network/protocol/TransportMessage.java | 17 +-
.../network/protocol/TransportMessagesHelper.java | 48 ++
.../network/server/TransportChannelHandler.java | 17 +-
.../network/server/TransportRequestHandler.java | 29 +-
.../ConflictAvoidEventExecutorChooserFactory.java | 53 ++
.../celeborn/common/network/util/NettyUtils.java | 33 +-
.../common/network/util/TransportConf.java | 22 +-
.../common/protocol/PartitionLocation.java | 10 +-
.../celeborn/common/protocol/StorageInfo.java | 59 +-
.../common/protocol/message/StatusCode.java | 11 +-
.../common/write/LocationPushFailedBatches.java | 94 +++
.../apache/celeborn/common/write/PushState.java | 14 +
common/src/main/proto/TransportMessages.proto | 37 ++
.../org/apache/celeborn/common/CelebornConf.scala | 597 ++++++++++++++---
.../common/identity/DefaultIdentityProvider.scala | 3 +-
.../identity/HadoopBasedIdentityProvider.scala | 4 +-
.../common/identity/IdentityProvider.scala | 6 +-
.../apache/celeborn/common/meta/DeviceInfo.scala | 6 +-
.../common/meta/ShufflePartitionLocationInfo.scala | 12 +-
.../apache/celeborn/common/meta/WorkerInfo.scala | 39 +-
.../common/metrics/source/AbstractSource.scala | 244 +++++--
.../celeborn/common/metrics/source/Role.scala | 6 +-
.../common/protocol/message/ControlMessages.scala | 108 +++-
.../common/quota/ResourceConsumption.scala | 21 +
.../quota/{Quota.scala => StorageQuota.scala} | 6 +-
.../celeborn/common/rpc/RpcEndpointRef.scala | 43 ++
.../org/apache/celeborn/common/rpc/RpcEnv.scala | 16 +
.../celeborn/common/rpc/RpcMetricsTracker.scala | 4 +-
.../celeborn/common/rpc/netty/NettyRpcEnv.scala | 18 +-
.../common/serializer/JavaSerializer.scala | 34 +
.../celeborn/common/util/CelebornHadoopUtils.scala | 34 +-
.../org/apache/celeborn/common/util/KeyLock.scala | 70 ++
.../apache/celeborn/common/util/PbSerDeUtils.scala | 81 ++-
.../apache/celeborn/common/util/ThreadUtils.scala | 21 +-
.../org/apache/celeborn/common/util/Utils.scala | 174 ++---
.../common/network/util/TransportConfSuiteJ.java | 26 +
.../common/protocol/PartitionLocationSuiteJ.java | 7 +-
.../write/LocationPushFailedBatchesSuiteJ.java | 114 ++++
.../apache/celeborn/common/CelebornConfSuite.scala | 22 +-
.../identity/DefaultIdentityProviderSuite.scala | 52 ++
.../meta/ShufflePartitionLocationInfoSuite.scala | 6 +-
.../celeborn/common/meta/WorkerInfoSuite.scala | 8 +-
.../celeborn/common/util/PbSerDeUtilsTest.scala | 109 +++-
.../apache/celeborn/common/util/UtilsSuite.scala | 27 +-
conf/log4j2.xml.template | 20 +
cpp/README.md | 2 +-
cpp/celeborn/CMakeLists.txt | 6 +
cpp/celeborn/{protocol => client}/CMakeLists.txt | 22 +-
cpp/celeborn/client/ShuffleClient.cpp | 140 ++++
cpp/celeborn/client/ShuffleClient.h | 85 +++
cpp/celeborn/client/reader/CelebornInputStream.cpp | 214 ++++++
cpp/celeborn/client/reader/CelebornInputStream.h | 80 +++
.../client/reader/WorkerPartitionReader.cpp | 148 +++++
cpp/celeborn/client/reader/WorkerPartitionReader.h | 93 +++
.../{protocol => client}/tests/CMakeLists.txt | 17 +-
.../client/tests/WorkerPartitionReaderTest.cpp | 229 +++++++
cpp/celeborn/conf/CelebornConf.cpp | 17 +-
cpp/celeborn/conf/tests/BaseConfTest.cpp | 6 +-
cpp/celeborn/memory/ByteBuffer.cpp | 2 +-
cpp/celeborn/{protocol => network}/CMakeLists.txt | 18 +-
cpp/celeborn/network/FrameDecoder.h | 84 +++
cpp/celeborn/network/Message.cpp | 153 +++++
cpp/celeborn/network/Message.h | 240 +++++++
cpp/celeborn/network/MessageDispatcher.cpp | 257 ++++++++
cpp/celeborn/network/MessageDispatcher.h | 113 ++++
cpp/celeborn/network/NettyRpcEndpointRef.cpp | 76 +++
cpp/celeborn/network/NettyRpcEndpointRef.h | 61 ++
cpp/celeborn/network/TransportClient.cpp | 192 ++++++
cpp/celeborn/network/TransportClient.h | 138 ++++
.../{utils => network}/tests/CMakeLists.txt | 19 +-
cpp/celeborn/network/tests/FrameDecoderTest.cpp | 123 ++++
.../network/tests/MessageDispatcherTest.cpp | 200 ++++++
cpp/celeborn/network/tests/MessageTest.cpp | 155 +++++
.../network/tests/NettyRpcEndpointRefTest.cpp | 118 ++++
cpp/celeborn/network/tests/TransportClientTest.cpp | 265 ++++++++
cpp/celeborn/protocol/CMakeLists.txt | 4 +-
cpp/celeborn/protocol/ControlMessages.cpp | 182 ++++++
cpp/celeborn/protocol/ControlMessages.h | 107 +++
cpp/celeborn/protocol/PartitionLocation.cpp | 30 +
cpp/celeborn/protocol/PartitionLocation.h | 4 +
cpp/celeborn/protocol/tests/CMakeLists.txt | 3 +-
.../protocol/tests/ControlMessagesTest.cpp | 244 +++++++
cpp/celeborn/{utils => }/tests/CMakeLists.txt | 24 +-
cpp/celeborn/tests/DataSumWithReaderClient.cpp | 85 +++
cpp/celeborn/utils/CMakeLists.txt | 4 +-
cpp/celeborn/utils/CelebornUtils.cpp | 78 +++
cpp/celeborn/utils/CelebornUtils.h | 75 +++
cpp/celeborn/utils/tests/ExceptionTest.cpp | 24 +-
cpp/scripts/setup-ubuntu.sh | 3 +
dev/dependencies.sh | 10 +-
dev/deps/dependencies-client-flink-1.16 | 70 +-
dev/deps/dependencies-client-flink-1.17 | 70 +-
dev/deps/dependencies-client-flink-1.18 | 70 +-
dev/deps/dependencies-client-flink-1.19 | 70 +-
dev/deps/dependencies-client-flink-1.20 | 70 +-
dev/deps/dependencies-client-flink-2.0 | 80 +++
dev/deps/dependencies-client-mr | 72 +--
dev/deps/dependencies-client-spark-2.4 | 68 +-
dev/deps/dependencies-client-spark-3.0 | 68 +-
dev/deps/dependencies-client-spark-3.1 | 68 +-
dev/deps/dependencies-client-spark-3.2 | 68 +-
dev/deps/dependencies-client-spark-3.3 | 68 +-
dev/deps/dependencies-client-spark-3.4 | 68 +-
dev/deps/dependencies-client-spark-3.5 | 68 +-
dev/deps/dependencies-client-spark-4.0 | 80 +++
dev/deps/dependencies-client-tez | 72 +--
dev/deps/dependencies-server | 108 ++--
dev/reformat | 2 +
docker/Dockerfile | 2 +-
docs/README.md | 4 +-
docs/celeborn_cli.md | 29 +-
docs/configuration/client.md | 27 +-
docs/configuration/master.md | 19 +-
docs/configuration/metrics.md | 4 +-
docs/configuration/network.md | 9 +-
docs/configuration/quota.md | 19 +-
docs/configuration/worker.md | 23 +-
docs/deploy.md | 2 +-
docs/developers/integrate.md | 2 +-
docs/developers/jvmprofiler.md | 2 +-
docs/developers/release.md | 303 +++++++++
docs/developers/sbt.md | 7 +
docs/developers/shuffleclient.md | 2 +-
docs/developers/slotsallocation.md | 4 +-
docs/developers/workerexclusion.md | 2 +-
docs/migration.md | 16 +-
docs/monitoring.md | 43 +-
docs/restapi.md | 8 +-
master/pom.xml | 29 +
.../service/deploy/master/SlotsAllocator.java | 71 +-
.../master/clustermeta/AbstractMetaManager.java | 66 +-
.../master/clustermeta/IMetadataHandler.java | 2 +
.../clustermeta/SingleMasterMetaManager.java | 36 +-
.../master/clustermeta/ha/HAMasterMetaManager.java | 12 +-
.../deploy/master/clustermeta/ha/MetaHandler.java | 38 +-
master/src/main/proto/Resource.proto | 139 ++--
.../celeborn/service/deploy/master/Master.scala | 178 +++--
.../service/deploy/master/MasterSource.scala | 10 +-
.../deploy/master/audit/ShuffleAuditLogger.scala | 29 +-
.../deploy/master/http/api/ApiMasterResource.scala | 44 +-
.../master/http/api/v1/ApplicationResource.scala | 22 +-
.../deploy/master/http/api/v1/MasterResource.scala | 7 +-
.../deploy/master/http/api/v1/RatisResource.scala | 45 +-
.../master/http/api/v1/ShuffleResource.scala | 8 +-
.../deploy/master/http/api/v1/WorkerResource.scala | 46 +-
.../service/deploy/master/quota/QuotaManager.scala | 429 ++++++++++--
.../service/deploy/master/quota/QuotaStatus.scala | 18 +-
.../deploy/master/SlotsAllocatorJmhBenchmark.java | 87 +++
.../deploy/master/SlotsAllocatorSuiteJ.java | 262 ++++----
.../clustermeta/DefaultMetaSystemSuiteJ.java | 65 +-
.../ha/RatisMasterStatusSystemSuiteJ.java | 64 +-
...onfig-quota.yaml => dynamicConfig-quota-2.yaml} | 20 +-
...Config-tags.yaml => dynamicConfig-quota-3.yaml} | 23 +-
master/src/test/resources/dynamicConfig-quota.yaml | 17 +-
.../http/api/v1/ApiV1MasterResourceSuite.scala | 7 +-
.../deploy/master/quota/QuotaManagerSuite.scala | 661 ++++++++++++++++++-
.../{ => multipart-uploader-oss}/pom.xml | 25 +-
.../apache/celeborn/OssMultipartUploadHandler.java | 142 ++++
.../{ => multipart-uploader-s3}/pom.xml | 6 +-
.../apache/celeborn/S3MultipartUploadHandler.java | 43 +-
.../celeborn/rest/v1/master/ApplicationApi.java | 4 +-
.../apache/celeborn/rest/v1/master/WorkerApi.java | 68 ++
.../celeborn/rest/v1/master/invoker/ApiClient.java | 2 +-
.../rest/v1/master/invoker/Configuration.java | 2 +-
...availableInfoRequest.java => TopologyInfo.java} | 60 +-
.../{LoggerInfos.java => TopologyResponse.java} | 52 +-
.../apache/celeborn/rest/v1/model/WorkerData.java | 37 +-
.../celeborn/rest/v1/worker/invoker/ApiClient.java | 2 +-
.../rest/v1/worker/invoker/Configuration.java | 2 +-
.../src/main/openapi3/master_rest_v1.yaml | 42 +-
.../src/main/openapi3/worker_rest_v1.yaml | 5 +-
pom.xml | 132 +++-
project/CelebornBuild.scala | 167 +++--
service/pom.xml | 24 +-
.../common/service/config/DynamicConfig.java | 115 +++-
.../config/DynamicConfigServiceFactory.java | 2 +-
.../server/common/http/api/ApiBaseResource.scala | 45 +-
.../server/common/http/api/v1/ApiUtils.scala | 1 +
.../common/http/api/v1/ApiV1BaseResource.scala | 9 +-
.../server/common/http/api/v1/ConfResource.scala | 17 +-
.../server/common/http/api/v1/LoggerResource.scala | 10 +-
tests/flink-it/pom.xml | 6 -
.../apache/celeborn/tests/flink/FlinkVersion.java | 5 +-
.../apache/celeborn/tests/flink/SplitHelper.java | 38 +-
.../celeborn/tests/flink/WordCountHelper.java | 39 +-
.../celeborn/tests/flink/HeartbeatTest.scala | 4 +-
.../tests/flink/HybridShuffleWordCountTest.scala | 25 +-
.../apache/celeborn/tests/flink/SplitTest.scala | 8 +-
.../celeborn/tests/flink/WordCountTest.scala | 7 +-
tests/spark-it/pom.xml | 122 +++-
.../ChangePartitionManagerUpdateWorkersSuite.scala | 74 ++-
.../client/LifecycleManagerCommitFilesSuite.scala | 67 +-
.../client/LifecycleManagerReserveSlotsSuite.scala | 174 +++++
.../celeborn/tests/client/ShuffleClientSuite.scala | 18 +
.../spark/CelebornFetchFailureDiskCleanSuite.scala | 154 +++++
.../tests/spark/CelebornFetchFailureSuite.scala | 67 +-
.../celeborn/tests/spark/CelebornHashSuite.scala | 41 ++
.../celeborn/tests/spark/CelebornSortSuite.scala | 42 ++
...stSuite.scala => CelebornStageRerunSuite.scala} | 4 +-
.../celeborn/tests/spark/HeartbeatTest.scala | 4 +-
.../spark/LocationPushFailedBatchesSuite.scala | 91 +++
.../celeborn/tests/spark/SkewJoinSuite.scala | 114 ++--
.../celeborn/tests/spark/SparkTestBase.scala | 15 +-
.../fetch/failure/ShuffleReaderGetHooks.scala | 93 +++
.../tests/spark/memory/MemorySparkTestBase.scala | 1 +
.../spark/shuffle/celeborn/SparkUtilsSuite.scala | 220 +++++++
version.sbt | 2 +-
worker/pom.xml | 16 +-
.../UserCongestionControlContext.java | 2 +-
.../deploy/worker/memory/MemoryManager.java | 222 +++++--
.../deploy/worker/storage/ChunkStreamManager.java | 7 -
.../deploy/worker/storage/MapPartitionData.java | 14 +-
.../worker/storage/MapPartitionDataWriter.java | 355 ----------
.../deploy/worker/storage/PartitionDataWriter.java | 718 ++++-----------------
.../worker/storage/PartitionDataWriterContext.java | 71 +-
.../worker/storage/PartitionFilesSorter.java | 34 +-
.../worker/storage/ReducePartitionDataWriter.java | 118 ----
.../deploy/worker/storage/TierWriterHelper.java | 54 ++
.../segment/SegmentMapPartitionFileWriter.java | 169 -----
.../service/deploy/worker/Controller.scala | 157 +++--
.../service/deploy/worker/FetchHandler.scala | 70 +-
.../service/deploy/worker/PushDataHandler.scala | 168 ++---
.../celeborn/service/deploy/worker/Worker.scala | 57 +-
.../service/deploy/worker/WorkerSource.scala | 46 +-
.../deploy/worker/http/api/ApiWorkerResource.scala | 29 +-
.../worker/http/api/v1/ApplicationResource.scala | 7 +-
.../worker/http/api/v1/ShuffleResource.scala | 12 +-
.../deploy/worker/http/api/v1/WorkerResource.scala | 18 +-
.../deploy/worker/storage/CelebornFile.scala | 93 ---
.../deploy/worker/storage/CelebornFileProxy.scala | 59 --
.../deploy/worker/storage/DeviceMonitor.scala | 53 +-
.../service/deploy/worker/storage/FlushTask.scala | 34 +-
.../service/deploy/worker/storage/Flusher.scala | 19 +
.../worker/storage/PartitionMetaHandler.scala | 479 ++++++++++++++
.../deploy/worker/storage/StorageManager.scala | 197 +++---
.../deploy/worker/storage/StoragePolicy.scala | 185 ++++--
.../service/deploy/worker/storage/TierWriter.scala | 705 ++++++++++++++++++++
.../congestcontrol/TestCongestionController.java | 22 +
.../storage/PartitionDataWriterSuiteUtils.java | 107 ++-
.../local/DiskMapPartitionDataWriterSuiteJ.java | 48 +-
.../local/DiskPartitionFilesSorterSuiteJ.java | 1 -
.../local/DiskReducePartitionDataWriterSuiteJ.java | 417 ++++++------
.../memory/MemoryPartitionFilesSorterSuiteJ.java | 3 +-
.../MemoryReducePartitionDataWriterSuiteJ.java | 525 +++++++++------
.../celeborn/service/deploy/HeartbeatFeature.scala | 33 +-
.../deploy/cluster/JavaReadCppWriteTestBase.scala | 143 ++++
...NE.scala => JavaReadCppWriteTestWithNONE.scala} | 7 +-
...ase.scala => LocalReadByChunkOffsetsTest.scala} | 112 ++--
.../service/deploy/cluster/ReadWriteTestBase.scala | 3 +
...tBase.scala => ReadWriteTestWithFailures.scala} | 118 ++--
.../service/deploy/memory/MemoryManagerSuite.scala | 174 ++++-
.../service/deploy/worker/WorkerSuite.scala | 306 +++++++++
.../deploy/worker/storage/DeviceMonitorSuite.scala | 25 +-
.../worker/storage/PartitionMetaHandlerSuite.scala | 220 +++++++
.../deploy/worker/storage/StoragePolicySuite.scala | 83 ---
.../deploy/worker/storage/TierWriterSuite.scala | 312 +++++++++
.../deploy/worker/storage/WorkerSuite.scala | 164 -----
.../deploy/worker/storage/WriterUtils.scala | 58 ++
.../storage/storagePolicy/StoragePolicyCase1.scala | 116 ++++
.../storage/storagePolicy/StoragePolicyCase2.scala | 116 ++++
.../storage/storagePolicy/StoragePolicyCase3.scala | 119 ++++
.../storage/storagePolicy/StoragePolicyCase4.scala | 117 ++++
454 files changed, 25154 insertions(+), 7098 deletions(-)
create mode 100644 .github/labeler.yml
create mode 100644 .github/workflows/cpp_integration.yml
create mode 100644 .github/workflows/docker-build.yml
create mode 100644 .github/workflows/labeler.yml
create mode 100644
assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark3_2.patch
create mode 100644
assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark3_3.patch
create mode 100644
assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark3_4.patch
create mode 100644
assets/spark-patch/Celeborn-Optimize-Skew-Partitions-spark3_5.patch
create mode 100644 build/release/known_translations
create mode 100755 build/release/pre_gen_release_notes.py
create mode 100755 build/release/release_utils.py
create mode 100644 charts/celeborn/files/conf/log4j2.xml
copy conf/metrics.properties.template =>
charts/celeborn/files/conf/metrics.properties (91%)
create mode 100644 charts/celeborn/templates/master/_helpers.tpl
create mode 100644 charts/celeborn/templates/worker/_helpers.tpl
create mode 100644
client-flink/common/src/main/java/org/apache/flink/runtime/metrics/dump/ShuffleQueryScopeInfo.java
create mode 100644
client-flink/common/src/main/java/org/apache/flink/runtime/metrics/groups/ShuffleIOMetricGroup.java
create mode 100644
client-flink/common/src/main/java/org/apache/flink/runtime/metrics/groups/ShuffleMetricGroup.java
create mode 100644
client-flink/common/src/main/java/org/apache/flink/runtime/metrics/scope/ShuffleScopeFormat.java
copy {client-spark/spark-2-shaded => client-flink/flink-2.0-shaded}/pom.xml
(96%)
copy client-flink/{flink-1.16-shaded =>
flink-2.0-shaded}/src/main/resources/META-INF/LICENSE (100%)
copy client-flink/{flink-1.16-shaded =>
flink-2.0-shaded}/src/main/resources/META-INF/NOTICE (100%)
copy client-flink/{flink-1.16-shaded =>
flink-2.0-shaded}/src/main/resources/META-INF/licenses/LICENSE-protobuf.txt
(100%)
copy client-flink/{flink-1.17 => flink-2.0}/pom.xml (95%)
copy client-flink/{flink-1.19 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGate.java
(90%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateFactory.java
(84%)
copy client-flink/{flink-1.17 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartition.java
(100%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartitionFactory.java
(93%)
copy client-flink/{flink-1.19 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleServiceFactory.java
(87%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/tiered/CelebornChannelBufferManager.java
(99%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/tiered/CelebornChannelBufferReader.java
(96%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/tiered/CelebornTierConsumerAgent.java
(100%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/tiered/CelebornTierFactory.java
(95%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/tiered/CelebornTierMasterAgent.java
(99%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/tiered/CelebornTierProducerAgent.java
(100%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/main/java/org/apache/celeborn/plugin/flink/tiered/TierShuffleDescriptorImpl.java
(100%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleMasterSuiteJ.java
(99%)
copy client-flink/{flink-1.16 =>
flink-2.0}/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartitionFactorySuiteJ.java
(100%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartitionSuiteJ.java
(93%)
copy client-flink/{flink-1.17 =>
flink-2.0}/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleServiceFactorySuiteJ.java
(100%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/test/java/org/apache/celeborn/plugin/flink/ShuffleResourceTrackerSuiteJ.java
(100%)
copy client-flink/{flink-1.20 =>
flink-2.0}/src/test/java/org/apache/celeborn/plugin/flink/tiered/CelebornTierMasterAgentSuiteJ.java
(96%)
create mode 100644
client-spark/common/src/main/scala/org/apache/celeborn/spark/FailedShuffleCleaner.scala
copy
client-spark/common/src/{main/java/org/apache/spark/shuffle/celeborn/OpenByteArrayOutputStream.java
=> test/java/org/apache/spark/shuffle/celeborn/SparkCommonUtilsSuiteJ.java}
(74%)
copy
client-spark/{common/src/main/java/org/apache/spark/shuffle/celeborn/OpenByteArrayOutputStream.java
=>
spark-2/src/main/scala/org/apache/spark/shuffle/celeborn/ShuffleFetchFailureReportTaskCleanListener.scala}
(65%)
delete mode 100644
client-spark/spark-3-4/src/main/java/org/apache/spark/shuffle/celeborn/BasedShuffleWriter.java
delete mode 100644
client-spark/spark-3-4/src/main/java/org/apache/spark/shuffle/celeborn/SortBasedShuffleWriter.java
delete mode 100644
client-spark/spark-3-4/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java
rename client-spark/{spark-3-4 => spark-3}/pom.xml (87%)
create mode 100644
client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/CelebornPartitionUtil.java
rename client-spark/{spark-3-4 =>
spark-3}/src/main/java/org/apache/spark/shuffle/celeborn/CelebornShuffleDataIO.java
(100%)
rename client-spark/{spark-3-4 =>
spark-3}/src/main/java/org/apache/spark/shuffle/celeborn/HashBasedShuffleWriter.java
(56%)
copy client-spark/{spark-2 =>
spark-3}/src/main/java/org/apache/spark/shuffle/celeborn/SortBasedShuffleWriter.java
(71%)
rename client-spark/{spark-3-4 =>
spark-3}/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java
(87%)
create mode 100644
client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java
rename client-spark/{spark-3-4 =>
spark-3}/src/main/scala/org/apache/spark/SparkVersionUtil.scala (100%)
rename client-spark/{spark-3-4 =>
spark-3}/src/main/scala/org/apache/spark/celeborn/ExceptionMakerHelper.scala
(100%)
rename client-spark/{spark-3-4 =>
spark-3}/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleFallbackPolicyRunner.scala
(76%)
rename client-spark/{spark-3-4 =>
spark-3}/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleHandle.scala
(100%)
rename client-spark/{spark-3-4 =>
spark-3}/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReader.scala
(60%)
copy
client-spark/{common/src/main/java/org/apache/spark/shuffle/celeborn/OpenByteArrayOutputStream.java
=>
spark-3/src/main/scala/org/apache/spark/shuffle/celeborn/ShuffleFetchFailureReportTaskCleanListener.scala}
(65%)
create mode 100644
client-spark/spark-3/src/test/java/org/apache/spark/shuffle/celeborn/CelebornPartitionUtilSuiteJ.java
rename client-spark/{spark-3-4 =>
spark-3}/src/test/java/org/apache/spark/shuffle/celeborn/CelebornShuffleWriterSuiteBase.java
(99%)
rename client-spark/{spark-3-4 =>
spark-3}/src/test/java/org/apache/spark/shuffle/celeborn/HashBasedShuffleWriterSuiteJ.java
(100%)
rename client-spark/{spark-3-4 =>
spark-3}/src/test/java/org/apache/spark/shuffle/celeborn/ShuffleManagerHook.java
(100%)
rename client-spark/{spark-3-4 =>
spark-3}/src/test/java/org/apache/spark/shuffle/celeborn/SortBasedShuffleWriterSuiteJ.java
(95%)
rename client-spark/{spark-3-4 =>
spark-3}/src/test/java/org/apache/spark/shuffle/celeborn/TestCelebornShuffleManager.java
(100%)
rename client-spark/{spark-3-4 => spark-3}/src/test/resources/log4j.properties
(100%)
rename client-spark/{spark-3-4 => spark-3}/src/test/resources/log4j2-test.xml
(100%)
rename client-spark/{spark-3-4 =>
spark-3}/src/test/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleManagerSuite.scala
(100%)
create mode 100644
client-spark/spark-3/src/test/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReaderSuite.scala
rename client/src/{test =>
main}/java/org/apache/celeborn/client/DummyShuffleClient.java (90%)
copy cli/src/main/scala/org/apache/celeborn/cli/common/CliLogging.scala =>
client/src/main/java/org/apache/celeborn/client/read/checkpoint/PartitionReaderCheckpointMetadata.java
(56%)
copy
common/src/main/java/org/apache/celeborn/common/network/protocol/{Heartbeat.java
=> SerdeVersion.java} (54%)
create mode 100644
common/src/main/java/org/apache/celeborn/common/network/protocol/TransportMessagesHelper.java
create mode 100644
common/src/main/java/org/apache/celeborn/common/network/util/ConflictAvoidEventExecutorChooserFactory.java
create mode 100644
common/src/main/java/org/apache/celeborn/common/write/LocationPushFailedBatches.java
rename common/src/main/scala/org/apache/celeborn/common/quota/{Quota.scala =>
StorageQuota.scala} (90%)
create mode 100644
common/src/main/scala/org/apache/celeborn/common/util/KeyLock.scala
create mode 100644
common/src/test/java/org/apache/celeborn/common/write/LocationPushFailedBatchesSuiteJ.java
create mode 100644
common/src/test/scala/org/apache/celeborn/common/identity/DefaultIdentityProviderSuite.scala
copy cpp/celeborn/{protocol => client}/CMakeLists.txt (76%)
create mode 100644 cpp/celeborn/client/ShuffleClient.cpp
create mode 100644 cpp/celeborn/client/ShuffleClient.h
create mode 100644 cpp/celeborn/client/reader/CelebornInputStream.cpp
create mode 100644 cpp/celeborn/client/reader/CelebornInputStream.h
create mode 100644 cpp/celeborn/client/reader/WorkerPartitionReader.cpp
create mode 100644 cpp/celeborn/client/reader/WorkerPartitionReader.h
copy cpp/celeborn/{protocol => client}/tests/CMakeLists.txt (74%)
create mode 100644 cpp/celeborn/client/tests/WorkerPartitionReaderTest.cpp
copy cpp/celeborn/{protocol => network}/CMakeLists.txt (77%)
create mode 100644 cpp/celeborn/network/FrameDecoder.h
create mode 100644 cpp/celeborn/network/Message.cpp
create mode 100644 cpp/celeborn/network/Message.h
create mode 100644 cpp/celeborn/network/MessageDispatcher.cpp
create mode 100644 cpp/celeborn/network/MessageDispatcher.h
create mode 100644 cpp/celeborn/network/NettyRpcEndpointRef.cpp
create mode 100644 cpp/celeborn/network/NettyRpcEndpointRef.h
create mode 100644 cpp/celeborn/network/TransportClient.cpp
create mode 100644 cpp/celeborn/network/TransportClient.h
copy cpp/celeborn/{utils => network}/tests/CMakeLists.txt (71%)
create mode 100644 cpp/celeborn/network/tests/FrameDecoderTest.cpp
create mode 100644 cpp/celeborn/network/tests/MessageDispatcherTest.cpp
create mode 100644 cpp/celeborn/network/tests/MessageTest.cpp
create mode 100644 cpp/celeborn/network/tests/NettyRpcEndpointRefTest.cpp
create mode 100644 cpp/celeborn/network/tests/TransportClientTest.cpp
create mode 100644 cpp/celeborn/protocol/ControlMessages.cpp
create mode 100644 cpp/celeborn/protocol/ControlMessages.h
create mode 100644 cpp/celeborn/protocol/tests/ControlMessagesTest.cpp
copy cpp/celeborn/{utils => }/tests/CMakeLists.txt (73%)
create mode 100644 cpp/celeborn/tests/DataSumWithReaderClient.cpp
create mode 100644 cpp/celeborn/utils/CelebornUtils.cpp
create mode 100644 dev/deps/dependencies-client-flink-2.0
create mode 100644 dev/deps/dependencies-client-spark-4.0
create mode 100644 docs/developers/release.md
copy
service/src/main/scala/org/apache/celeborn/server/common/http/RestAuditLogger.scala
=>
master/src/main/scala/org/apache/celeborn/service/deploy/master/audit/ShuffleAuditLogger.scala
(55%)
copy
client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/buffer/BufferRecycler.java
=>
master/src/main/scala/org/apache/celeborn/service/deploy/master/quota/QuotaStatus.scala
(59%)
create mode 100644
master/src/test/java/org/apache/celeborn/service/deploy/master/SlotsAllocatorJmhBenchmark.java
copy master/src/test/resources/{dynamicConfig-quota.yaml =>
dynamicConfig-quota-2.yaml} (65%)
copy master/src/test/resources/{dynamicConfig-tags.yaml =>
dynamicConfig-quota-3.yaml} (68%)
copy multipart-uploader/{ => multipart-uploader-oss}/pom.xml (78%)
create mode 100644
multipart-uploader/multipart-uploader-oss/src/main/java/org/apache/celeborn/OssMultipartUploadHandler.java
rename multipart-uploader/{ => multipart-uploader-s3}/pom.xml (92%)
rename multipart-uploader/{ =>
multipart-uploader-s3}/src/main/java/org/apache/celeborn/S3MultipartUploadHandler.java
(84%)
copy
openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/{RemoveWorkersUnavailableInfoRequest.java
=> TopologyInfo.java} (64%)
copy
openapi/openapi-client/src/main/java/org/apache/celeborn/rest/v1/model/{LoggerInfos.java
=> TopologyResponse.java} (65%)
create mode 100644
tests/spark-it/src/test/scala/org/apache/celeborn/tests/client/LifecycleManagerReserveSlotsSuite.scala
create mode 100644
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/CelebornFetchFailureDiskCleanSuite.scala
rename
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/{CelebornShuffleLostSuite.scala
=> CelebornStageRerunSuite.scala} (96%)
create mode 100644
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/LocationPushFailedBatchesSuite.scala
create mode 100644
tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/fetch/failure/ShuffleReaderGetHooks.scala
create mode 100644
tests/spark-it/src/test/scala/org/apache/spark/shuffle/celeborn/SparkUtilsSuite.scala
delete mode 100644
worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/MapPartitionDataWriter.java
delete mode 100644
worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/ReducePartitionDataWriter.java
create mode 100644
worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/TierWriterHelper.java
delete mode 100644
worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/segment/SegmentMapPartitionFileWriter.java
delete mode 100644
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/CelebornFile.scala
delete mode 100644
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/CelebornFileProxy.scala
create mode 100644
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/PartitionMetaHandler.scala
create mode 100644
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/TierWriter.scala
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/cluster/JavaReadCppWriteTestBase.scala
copy
worker/src/test/scala/org/apache/celeborn/service/deploy/cluster/{ClusterReadWriteTestWithNONE.scala
=> JavaReadCppWriteTestWithNONE.scala} (85%)
copy
worker/src/test/scala/org/apache/celeborn/service/deploy/cluster/{ReadWriteTestBase.scala
=> LocalReadByChunkOffsetsTest.scala} (54%)
copy
worker/src/test/scala/org/apache/celeborn/service/deploy/cluster/{ReadWriteTestBase.scala
=> ReadWriteTestWithFailures.scala} (50%)
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/WorkerSuite.scala
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/PartitionMetaHandlerSuite.scala
delete mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/StoragePolicySuite.scala
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/TierWriterSuite.scala
delete mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/WorkerSuite.scala
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/WriterUtils.scala
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/storagePolicy/StoragePolicyCase1.scala
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/storagePolicy/StoragePolicyCase2.scala
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/storagePolicy/StoragePolicyCase3.scala
create mode 100644
worker/src/test/scala/org/apache/celeborn/service/deploy/worker/storage/storagePolicy/StoragePolicyCase4.scala