This is an automated email from the ASF dual-hosted git repository.
enricomi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle-website.git
The following commit(s) were added to refs/heads/master by this push:
new 6f9e493 Add release notes for v0.9.0 (#75)
6f9e493 is described below
commit 6f9e4930909400e5814d113b389bb8d3390f574c
Author: Enrico Minack <[email protected]>
AuthorDate: Tue Jun 18 12:10:28 2024 +0200
Add release notes for v0.9.0 (#75)
---
download/release-notes-0.9.0.md | 255 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 255 insertions(+)
diff --git a/download/release-notes-0.9.0.md b/download/release-notes-0.9.0.md
new file mode 100644
index 0000000..fe44b34
--- /dev/null
+++ b/download/release-notes-0.9.0.md
@@ -0,0 +1,255 @@
+---
+title: Release Notes 0.9.0
+sidebar_position: 995
+---
+
+# Uniffle Release 0.9.0
+
+## Highlight
+
+- Introduce dashboard.
+- Introduce rust-based shuffle server.
+- Add support for Spark 3.5.
+- The data transportation Netty mode is production available.
+- Reduce block id layout limitations and simplify layout configuration for
Spark.
+
+## ChangeLog
+
+* [#1751][0.9] improvement: support gluten (#1753)
+* [#1764] fix(client): Fix timeout time unit for unregister requests (#1766)
+* [#1149] fix: GC logs in JDK 11 do not include date and time stamps. (#1240)
+* [#1675][FOLLOWUP] fix(test): Fix various flaky tests (#1730)
+* [MINOR] fix: Update outdated config: rss.writer.send.check.timeout ->
rss.client.send.check.timeout.ms (#1734)
+* [#1721] fix(coordinator): classCastExpection of boolean->String with yaml
style remote client conf (#1722)
+* [#1673] fix(K8S): Fix the deployment of stable version K8S cluster (#1694)
+* [#1675][FOLLOWUP] fix(test): Fix flaky tests which may cause port conflicts
(#1696)
+* [MINOR] fix(typo): Correct the removeShuffle method name (#1697)
+* [MINOR] docs: modify the default value of
`rss.coordinator.select.partition.strategy` in docs (#1692)
+* [#1680] improvement(server): Remove partial HDFS files that written by
server self for expired apps (#1681)
+* [#1675] fix(test): Fix tests which may be flaky on different machines (#1676)
+* [#1684] fix(server): use the diskSize obtained from periodic check to
determine whether is writable (#1685)
+* [#1678] fix(server): disk size leak on removing resources by AppPurgeEvent
(#1679) (#1689)
+* [#1657] build: Add license information after version 0.9.0 (#1671)
+* [MINOR] chore(rust): disable flaky test of local_store_test (#1674)
+* [#1459][FOLLOWUP] fix(server): Fix the issue of log variable printing (#1672)
+* [#1459][FOLLOWUP] improvement(server): Print an error log when an event is
dropped (#1643)
+* [#1341] fix(mr): Fix MR Combiner ArrayIndexOutOfBoundsException Bug. (#1666)
+* [#378][FOLLOWUP] fix(server): Fix huge_partition_num metric (#1669)
+* [#1662] fix(test): Fix Netty related flaky tests (#1663)
+* [#1629] fix(operator): Support parsing NaN float value in metrics (#1630)
+* [#1634] fix(server): remove app folder if app is expired (#1635)
+* [MINOR] chore(rust): disable flaky test of test_ticket_manager (#1637)
+* [#1596][FOLLOWUP] fix(netty): Send failed responses only when the channel is
writable (#1641)
+* [#1626] fix(server): Remove the meaningless eventOfUnderStorageManagers
cache (#1627)
+* [#1631] fix(server): ShuffleTaskInfo may leak when app is removed. (#1632)
+* [#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after
reassign (#1612)
+* [#1608][part-2] fix(spark): avoid releasing block in advance when enable
block resend (#1610)
+* [#1606] feat(client): Add client retry mechanism for NO_BUFFER when reading
data(memory/local/index) (#1616)
+* [#1608][part-1] fix(spark): Only share the replacement servers for faulty
servers in one stage (#1609)
+* [#1373][FOLLOWUP] fix(spark): shuffle manager rpc service invalid when
partition data reassign is enabled (#1583)
+* [#1596] fix(netty): Use a ChannelFutureListener callback mechanism to
release readMemory (#1605)
+* [#1598] fix(server) Fix inaccurate used_direct_memory_size metric (#1599)
+* [#1472][FOLLOWUP] improvement(server): Release memory more accurately when
failing to cache shuffle data (#1597)
+* [MINOR] refactor: Calling lock() method outside try block to avoid
unnecessary errors (#1590)
+* [#1591] feat(spark): Support Spark 3.5.1 (#1592)
+* [#1586] improvement(netty): Allow Netty Worker thread pool size to
dynamically adapt to the number of processor cores (#1587)
+* [#1588] improvement(server): Add exception handling for the thread pool when
flushing events (#1589)
+* [#1576] feat(doc): server deploy guide without hadoop-home env (#1577)
+* [#1571] fix(server): Memory may leak when `EventInvalidException` occurs
(#1574)
+* [#1373][FOLLOWUP] fix(spark): incorrect partition id type (#1582)
+* [#1373][FOLLOWUP] fix(spark3):Add client type when request shuffle
assignment (#1580)
+* build(deps): bump google.golang.org/protobuf from 1.28.0 to 1.33.0 (#1575)
+* [#1554] feat(spark): Fetch dynamic client conf as early as possible (#1557)
+* [#1572] fix(spark): Exceptions might be discarded when spilling buffers
(#1573)
+* [#1564] fix(server): disk health check invalid when hang (#1568)
+* [#731][FOLLOWUP] feat(Spark): Configure blockIdLayout for Spark based on max
partitions (#1566)
+* [#1567] fix(spark): Let Spark use its own NettyUtils (#1565)
+* [#1569] fix(rust): flaky test for test_ticket_manager (#1570)
+* [MINOR] improvement(test): A better computation logic for
WriteAndReadMetricsTest without using reflection (#1563)
+* [#731] feat(spark): Make blockid layout configurable for Spark clients
(#1528)
+* [#808] improvement(spark): Verify the number of written records to ensure
data correctness (#1558)
+* [MINOR] improvement(client): Override getClientInfo method in
ShuffleServerGrpcNettyClient and remove unused getDesc method (#1559)
+* [#1552] improvement: Migrate from log4j1 to log4j2 (#1553)
+* [#1472][part-6] FOLLOWUP: Fix Netty transport time when sending shuffle data
requests (#1551)
+* [#134][FOLLOWUP] improvement(spark2): Use taskId and attemptNo as
taskAttemptId (#1544)
+* [#1549] fix(common): Uniformly throw RssException for external callers
(#1550)
+* [MINOR] test: Use sensible partition ids in ShuffleReadClientImplTest (#1545)
+* [#1546] fix(spark): NPE could happen before uncompressing after #1360 (#1547)
+* feat(docker): Add example docker compose Uniffle/Spark cluster (#1532)
+* [#1472][part-6] fix(netty): Make UTs truly test Netty mode (#1540)
+* [MINOR] improvement(tez): Only invoking LOG.debug when LOG.isDebugEnabled is
true (#1541)
+* [#1459] fix(server): Memory leak for exceptional scenarios when flushing
events (#1537)
+* [#1472] fix(client): IlegalReferenceCountException for
clientReadHandler.readShuffleData (#1536)
+* [#1472][part-5] Use UnpooledByteBufAllocator to fix inaccurate usedMemory
issue causing OOM (#1534)
+* [MINOR] refactor(common): Move blockId bit logic into common class (#1527)
+* [#1373][part-1] feat(spark): partition write to multi servers leveraging
from reassignment mechanism (#1445)
+* [MINOR] Update dashboard pom.xml to take arguments for node and npm download
locations (#1530)
+* [#1316] improvement(spark): detect OutputTracker API version via Spark
version (#1317)
+* [#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId (#1529)
+* [MINOR] feat(build): Allow to build distribution without some modules (#1525)
+* [#1407] fix(rust): use grpc runtime worker threads and adjust default
runtime config (#1517)
+* [#1407] feat(rust): fix + add total grpc request metrics (#1516)
+* [#1407] chore(rust): add cpu profile doc (#1515)
+* [#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks
instead of reallocating it (#1521)
+* [MINOR] fix(CI): Improve dashboard across the CI (#1526)
+* [#1472][part-3] fix(client): Fix occasional IllegalReferenceCountException
issues in extremely rare scenarios (#1522)
+* [MINOR] fix(pom): Add missing shuffle-server dependencies to work with -Ptez
+* [#1472][part-4] feature(server): Add metrics for Netty's pinnedDirectMemory
and usedDirectMemory (#1524)
+* [#1472][part-1] fix(server): Upgrade Netty and GRPC (#1520)
+* [MINOR] fix(deploy): Fix invocation of kubernetes bash scripts (#1513)
+* [#1476] feat(rust): Provide dedicated unregister app rpc interface (#1511)
+* [#1476] feat(spark): Provide dedicated unregister app rpc interface (#1510)
+* [MINOR] improvement(CI): Rework build and rust workflow events (#1508)
+* [#1407] fix(rust): drop events and release memory when errors happened
(#1509)
+* [#1267][FOLLOWUP] improvement(client): INFO log level should be used in
RetryUtils (#1500)
+* [MINOR] feat(CI): Report test results in github comments (#1506)
+* [#1407] fix(rust): return error when getting data from hdfs by client (#1507)
+* [#1501] fix(server): storage selection cache accidentally deleted when
clearing stage level data. (#1505)
+* [#1407] fix(rust): dont panic when no available local disks (#1504)
+* fix(rust): avoid checking storage type in runtime (#1503)
+* [MINOR] build: Move dashboard module into profile and disable it by default
(#1498)
+* [#1497] improvement(spark): flushing buffer if the memoryUsed of the first
record of `WriterBuffer` larger than bufferSize (#1485)
+* [MINOR] improvement(test): Identify duplicate blocks in
TestUtils.validateResult (#1495)
+* [MINOR] fix: Get and increment ATOMIC_LONG in that order everywhere (#1496)
+* [MINOR] docs: Improve comment on blockId structure (#1492)
+* [MINOR] fix(server): Assert actual number of bitmaps matches bitNum (#1493)
+* [#1490] improvement(spark3): Disable dynamic allocation shuffle tracking by
default (#1491)
+* [#1407] feat(rust): support more metrics about disk and topN data size
(#1488)
+* [#1407] feat(rust): support multiple spill policies and simplify hdfs config
(#1487)
+* [#1356] feat(server): improve expired buffers metric and log (#1469)
+* [#1464][FOLLOWUP] improvement(spark): print abnormal shuffle servers that
blocks fail to send (#1473)
+* [#1467] feat(server): introduce total hdfs write data size for huge
partition (#1468)
+* [#1355] fix(client): Netty client will leak when decoding responses (#1455)
+* [#1462] fix(server): Memory may leak when flushQueue is full (#1463)
+* [#1466] feat(server): introduce the JvmPauseMonitor to detect the gc pause
(#1470)
+* [#1459] improvement(server): refactor DefaultFlushEventHandler and support
event retry into pending queue (#1461)
+* [#1464] improvement(spark): print abnormal shuffle servers that blocks fail
to send (#1465)
+* [#1456] improvement(client): Better exception handling when calling
requireBuffer using GRPC (#1457)
+* [#1428] fix(server): fallback invalid when local storage can't write (#1429)
+* [#1453] improvement: Force to use the UNIX line ending when using
spotless-maven-plugin (#1454)
+* [#1447] feat(client): Introduce configurations to control default behavior
of RPC client (#1448)
+* [#1267] improvement(client): throw the detailed stacktrace when exceptions
happened (#1411)
+* [#1189][FOLLOWUP] fix(server): Start NettyDirectMemoryTracker. (#1432)
+* [#333] feat(server): expose metrics of TopN app bytes in one shuffle server
(#1400)
+* [#1433] fix(server): Race conditions with ShuffleServer state (#1434)
+* [MINOR] refactor: avoid unnecessary bitmap clone and AND (#1442)
+* [#532] fix: spotBugs of SC_START_IN_CTOR (#1440)
+* [#1435] improvement: Improve log4j settings to avoid annoying messages
(#1436)
+* [MINOR] refactor: Avoid unnecessary recursion (#1441)
+* [#1407] feat(rust): refactor localfile store to speed up writing (#1422)
+* [#1416] feat(spark): support custom hadoop config in client side (#1417)
+* [#1119] improvement(client): Explicitly throw
`BUFFER_LIMIT_OF_HUGE_PARTITION` (#1425)
+* [#974] fix(coordinator): Dynamic remote storage conf invalid for
`LegacyClientConfParser` (#1424)
+* [#1420] fix(client): reportShuffleWriteFailure failed because of
IndexOutOfBoundsException (#1421)
+* [#1356] improvement: add metric of total expired pre-allocated buffers
(#1412)
+* [#1414] feat(rust): introduce native hdfs client (#1415)
+* [#1024] improvement(tez): Optimize user switch to shuffle mode local/remote.
(#1397)
+* [#1403] fix(client): RSS client configurations are not working. (#1404)
+* [#1409] fix(client): Netty Epoll is unavailable for the RSS Client. (#1410)
+* [#1407] improvement(rust): Critical bug fix of getting blockIds and some
optimization (#1408)
+* [#825][FOLLOWUP] fix(spark): Fix without returning an exception. (#1402)
+* [#1385] improvement: Improve log4j appender layout pattern (#1386)
+* [#851] improvement: Add a similar util method like ThreadUtils.parmap in the
Spark (#1396)
+* [#363] improvement(server): Make the coordinator client managed by
CoordinatorClientFactory singleton (#1377)
+* [#1391] fix(server): Direct memory may leak in exceptional scenarios in
shuffle server. (#1392)
+* [#1157] fix(tez): Container not exit because shuffle client is not closed
+* [#460] improvement: Exit on OutOfMemoryError (#1390)
+* [#1387] improvement: compatibility with jdk8 when call
JavaUtils.newConcurrentMap (#1389)
+* [#1369] feat: Provide distribution with Hadoop dependencies (#1379)
+* [#1383] [DOCS] Improve Netty's documentation (#1384)
+* [#1358] fix(spark): pre-check bytebuffer whether is direct before uncompress
(#1360)
+* [#1364] feat(client): introduce option to control whether to use local
hadoop conf (#1370)
+* [MINOR] chore(client): fix the incorrect partitionId (#1376)
+* [#1189] feat(server): Add netty used direct memory size metric (#1363)
+* [#960] fix(dashboard): simplify dependency and correct the startup script
(#1347)
+* [#1348] improvement(metrics): Unify tags generation for shuffle-server
metrics reporter (#1349)
+* [MINOR] chore: fix kubernetes ci pipeline (#1368)
+* [MINOR] fix(spark): Fix NPE for ShuffleWriteClientImpl.unregisterShuffle
(#1367)
+* [#960][part-4] feat(dashboard): Fix some display bugs and optimize the
display format. (#1326)
+* [#1267] fix(client): fast fail without retry when oom occurs (#1344)
+* [#1361] feat(netty): add netty metrics into reporter (#1362)
+* [#1335] fix(server)(netty): release bytebuf explicitly when requiredId is
expired or cache failed (#1357)
+* [MINOR] chore(client): Specify name for data transfer thread pool (#1353)
+* [#1319] fix(server): Add shaded com.google.guava:failureaccess dependency to
prevent NoClassDefFoundError (#1352)
+* [MINOR] improvement: use mvn wrapper in CI builds. (#1351)
+* [#1191][FOLLOWUP] improvement(conf): use the unified name for hybrid storage
in conf (#1350)
+* [#960][FOLLOWUP] fix(dashboard): Fix get_pid_file_name function for the
dashboard. (#1346)
+* [MINOR] improvement: use mvn wrapper for builds (#1345)
+* [#901] feat(server): respect disk capacity watermark rather than uniffle
capacity (#1337)
+* [#1342] improvement(server): dump appId when clearing resource fails (#1343)
+* [#1110] improvement(coordinator): introduce pluggable remote storage config
format (#1329)
+* [#1330] improvement: optimize tips for checking replica settings (#1334)
+* [#1187] feat(netty): Netty Encoder Support zero-copy. (#1313)
+* [#960][part-3] feat(dashboard): Provides a start-stop script for the
dashboard. (#1056)
+* [#1308] improvement(rust): detect whether data has been purged in UT (#1323)
+* [#1213] feat(rust): Support block filter by taskId when getting memory data
(#1311)
+* [#1290] improvement(operator): Avoid accidentally deleting data of other
services when misconfiguring the mounting directory (#1291)
+* [MINOR] fix: flaky test
ShuffleTaskManagerTest#checkAndClearLeakShuffleDataTest (#1320)
+* [MINOR] test: flaky test GrpcServerTest.testGrpcExecutorPool (#1321)
+* [#960][part-2] feat(dashboard): Add a dashboard front-end module. (#1055)
+* [#825][part-7] feat(spark): Write Stage resubmit and dynamic shuffle server
assign integration tests. (#1148)
+* [#1300] feat(mr): Support combine operation in map stage for mr engine.
(#1301)
+* [#1309] fix(spark): WriteBufferManager in Spark2 does not use a reassigned
shuffle server. (#1310)
+* [#1307] feat(rust): make each thread listen the socket to improve throughput
in tonic (#1306)
+* [#960][part-1] feat(dashboard): Add some dashboard interfaces. (#1053)
+* [#825][part-6] feat(spark): Added logic that failed to send ShuffleServer.
(#1147)
+* [#1293] feat(rust): Add total_read_data metric (#1298)
+* [#1094] docs: split client_guide.md (#1299)
+* [#1221] feat(rust): Support grpc server graceful shutdown (#1292)
+* [#1294] feat(rust): introduce the unified grpc latency metrics for all
requests (#1295)
+* [#1296] improvement(rust): use std.sync.lock to replace tokio lock for
better performance (#1216)
+* [#825][part-5] feat(spark): Adds the RPC interface to reassign the
ShuffleServer list. (#1146)
+* [MINOR] docs: update jar name for spark client (#1289)
+* [MINOR] chore: add scripts for publishing tarballs to svn (#1284)
+* [#1286] improvement(server): Add RemoveResourceTime Metric (#1288)
+* [#1271] improvement(server): change transportTime and processTime summary to
Thread Pool Instead of block (#1272)
+* [#1269] fix(tez): uniqueMapId may be not unique when more than one fetcher
are working. (#1270)
+* [#1246] feat(tez): Support remote spill for unordered input. (#1250)
+* [#825][part-4] feat(spark): Report write failures to ShuffleManager. (#1258)
+* [MINOR] fix: missing to build spark shaded modules (#1282)
+* [#1275] chore: add scripts for publishing maven releases (#1281)
+* [#1274] feat: add shaded module for spark2 client (#1280)
+* [#1273] feat: add shaded module for spark3 client (#1279)
+* [#825][part-3] feat(spark): Get the ShuffleServer corresponding to the
partition from ShuffleManager. (#1141)
+* [#1277] chore: add flatten maven plugin (#1278)
+* [#1252] fix(server): Incorrect storage write fail metric (#1253)
+* [#825][FOLLOWUP] fix(spark): Apply a thread safety way to track the blocks
sending result (#1260)
+* [#1254][FOLLOWUP] fix(test): Fix the flaky test RssShuffleTest. (#1259)
+* [#1261] fix(spark): Throw out InterruptedException for sleep in
requestExecutorMemory #1262
+* [#1256] refactor: optimize collections contruction (#1257)
+* [#1254] fix(test): Fix the flaky test RssShuffleTest. (#1255)
+* [#825][part-2] feat(spark): Report failed blocks and a list of
ShuffleServer. (#1138)
+* [#244][FOLLOWUP] test: CoordinatorGrpcTest.rpcMetricsTest. (#1251)
+* [#1231] feat(tez): Support remote spill in merge stage. (#1245)
+* [#1243] fix(test): Fix the flaky test `SparkSQLTest` and `RepartitionTest`
(#1244)
+* [#1089] feat(spark): Add dynamic allocation patch for Spark 2.3 (#1242)
+* [#1237] feat(rust): support populating args by clap (#1236)
+* [#1088] feat(spark): Add dynamic allocation patch for Spark 3.0 (#1241)
+* [#1234] improvement(rust): separate runtimes for different overload (#1233)
+* [#1090] refactor: Refactor the reader code with builder pattern (#1232)
+* [#1219] fix(test): Fix the flaky test `WriteAndReadMetricsTest` (#1235)
+* [#1206] chore(rust): ignore generated proto code in git (#1229)
+* [#1091] refactor: Refactor the writer code with builder pattern (#1228)
+* [MINOR] Fix kubernetest CI pipeline (#1227)
+* [#802] feat(spark): Implement ShuffleDataIo (#1226)
+* [#825][part-1] feat(spark): Add the RPC interface for reassigning
ShuffleServer (#1137)
+* [#1085] feat(spark): Add dynamic allocation patch for Spark 3.4 (#1225)
+* [#1201] improvement: only invoking LOG.debug when LOG.isDebugEnabled() is
true (#1217)
+* [#1084] feat: Add dynamic allocation patch for Spark 3.3 (#1224)
+* [#1083] feat(spark): Support Spark 3.5 (#1223)
+* [#1211] fix(server): unexpectedly removing resources when app has
re-registered shuffle later (#1212)
+* [#1206] chore(rust): remove the auto-generated proto code (#1218)
+* [#1209] improvement(server): Speed up cleanupStorageSelectionCache method in
LocalStorageManager. (#1210)
+* [#1206][part-2] feat(rust): introduce rust based shuffle-server (#1208)
+* [#1206][part-1] feat(rust): create folder for rust-based shuffle server
(#1207)
+* [#1204] chores(ci): Fix the ci pipeline of Kubernetes #1205
+* [#1202] improvement: Add HealthScriptChecker for execute special health
check shell script (#1203)
+* [#1198] improvement: zerocopy from Protobuf's ByteString to Netty's ByteBuf
(#1199)
+* [#1192] improvement(hdfs): Add
`RSS_SECURITY_HADOOP_KERBEROS_PROXY_USER_ENABLE` conf for storing shuffle data
(#1194)
+* [MINOR] refactor: Rename MultiStorage to HybridStorage (#1191)
+* [MINOR] Remove extra directory (#1190)
+* [#1178] improvement: set `rss.coordinator.quota.default.app.num` default -1
to indicate no quota check (#1186)
+* [#1182] fix(operator): The LeaderElectionNamespace of the rss-controller is
hard-coded to kube-system. (#1183)
+* [#1175] fix(netty): Retry failed with StacklessClosedChannelException after
channel closed (#1181)
+* [#1177] improvement: Reduce the write time of tasks (#1179)
+* [MINOR] docs: Fix spark.serializer in README and client_guide (#1180)