This is an automated email from the ASF dual-hosted git repository. enricomi pushed a commit to branch release-notes-0.9.0 in repository https://gitbox.apache.org/repos/asf/incubator-uniffle-website.git
commit 440ff350217562b6155346263c46fc267fcef44e Author: Enrico Minack <[email protected]> AuthorDate: Tue May 28 09:55:44 2024 +0200 Add release notes vor v0.9.0 --- download/release-notes-0.9.0.md | 249 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 249 insertions(+) diff --git a/download/release-notes-0.9.0.md b/download/release-notes-0.9.0.md new file mode 100644 index 0000000..2f1cf06 --- /dev/null +++ b/download/release-notes-0.9.0.md @@ -0,0 +1,249 @@ +--- +title: Release Notes 0.9.0 +sidebar_position: 995 +--- + +# Uniffle Release 0.9.0 + +## Highlight + +- Introduce dashboard. + +## ChangeLog + +* [#1149] fix: GC logs in JDK 11 do not include date and time stamps. (#1240) +* [#1675][FOLLOWUP] fix(test): Fix various flaky tests (#1730) +* [MINOR] fix: Update outdated config: rss.writer.send.check.timeout -> rss.client.send.check.timeout.ms (#1734) +* [#1721] fix(coordinator): classCastExpection of boolean->String with yaml style remote client conf (#1722) +* [#1673] fix(K8S): Fix the deployment of stable version K8S cluster (#1694) +* [#1675][FOLLOWUP] fix(test): Fix flaky tests which may cause port conflicts (#1696) +* [MINOR] fix(typo): Correct the removeShuffle method name (#1697) +* [MINOR] docs: modify the default value of `rss.coordinator.select.partition.strategy` in docs (#1692) +* [#1680] improvement(server): Remove partial HDFS files that written by server self for expired apps (#1681) +* [#1675] fix(test): Fix tests which may be flaky on different machines (#1676) +* [#1684] fix(server): use the diskSize obtained from periodic check to determine whether is writable (#1685) +* [#1678] fix(server): disk size leak on removing resources by AppPurgeEvent (#1679) (#1689) +* [#1657] build: Add license information after version 0.9.0 (#1671) +* [MINOR] chore(rust): disable flaky test of local_store_test (#1674) +* [#1459][FOLLOWUP] fix(server): Fix the issue of log variable printing (#1672) +* [#1459][FOLLOWUP] improvement(server): Print an error log when an event is dropped (#1643) +* [#1341] fix(mr): Fix MR Combiner ArrayIndexOutOfBoundsException Bug. (#1666) +* [#378][FOLLOWUP] fix(server): Fix huge_partition_num metric (#1669) +* [#1662] fix(test): Fix Netty related flaky tests (#1663) +* [#1629] fix(operator): Support parsing NaN float value in metrics (#1630) +* [#1634] fix(server): remove app folder if app is expired (#1635) +* [MINOR] chore(rust): disable flaky test of test_ticket_manager (#1637) +* [#1596][FOLLOWUP] fix(netty): Send failed responses only when the channel is writable (#1641) +* [#1626] fix(server): Remove the meaningless eventOfUnderStorageManagers cache (#1627) +* [#1631] fix(server): ShuffleTaskInfo may leak when app is removed. (#1632) +* [#1373][FOLLOWUP] fix(spark): register with incorrect partitionRanges after reassign (#1612) +* [#1608][part-2] fix(spark): avoid releasing block in advance when enable block resend (#1610) +* [#1606] feat(client): Add client retry mechanism for NO_BUFFER when reading data(memory/local/index) (#1616) +* [#1608][part-1] fix(spark): Only share the replacement servers for faulty servers in one stage (#1609) +* [#1373][FOLLOWUP] fix(spark): shuffle manager rpc service invalid when partition data reassign is enabled (#1583) +* [#1596] fix(netty): Use a ChannelFutureListener callback mechanism to release readMemory (#1605) +* [#1598] fix(server) Fix inaccurate used_direct_memory_size metric (#1599) +* [#1472][FOLLOWUP] improvement(server): Release memory more accurately when failing to cache shuffle data (#1597) +* [MINOR] refactor: Calling lock() method outside try block to avoid unnecessary errors (#1590) +* [#1591] feat(spark): Support Spark 3.5.1 (#1592) +* [#1586] improvement(netty): Allow Netty Worker thread pool size to dynamically adapt to the number of processor cores (#1587) +* [#1588] improvement(server): Add exception handling for the thread pool when flushing events (#1589) +* [#1576] feat(doc): server deploy guide without hadoop-home env (#1577) +* [#1571] fix(server): Memory may leak when `EventInvalidException` occurs (#1574) +* [#1373][FOLLOWUP] fix(spark): incorrect partition id type (#1582) +* [#1373][FOLLOWUP] fix(spark3):Add client type when request shuffle assignment (#1580) +* build(deps): bump google.golang.org/protobuf from 1.28.0 to 1.33.0 (#1575) +* [#1554] feat(spark): Fetch dynamic client conf as early as possible (#1557) +* [#1572] fix(spark): Exceptions might be discarded when spilling buffers (#1573) +* [#1564] fix(server): disk health check invalid when hang (#1568) +* [#731][FOLLOWUP] feat(Spark): Configure blockIdLayout for Spark based on max partitions (#1566) +* [#1567] fix(spark): Let Spark use its own NettyUtils (#1565) +* [#1569] fix(rust): flaky test for test_ticket_manager (#1570) +* [MINOR] improvement(test): A better computation logic for WriteAndReadMetricsTest without using reflection (#1563) +* [#731] feat(spark): Make blockid layout configurable for Spark clients (#1528) +* [#808] improvement(spark): Verify the number of written records to ensure data correctness (#1558) +* [MINOR] improvement(client): Override getClientInfo method in ShuffleServerGrpcNettyClient and remove unused getDesc method (#1559) +* [#1552] improvement: Migrate from log4j1 to log4j2 (#1553) +* [#1472][part-6] followup: Fix Netty transport time when sending shuffle data requests (#1551) +* [#134][FOLLOWUP] improvement(spark2): Use taskId and attemptNo as taskAttemptId (#1544) +* [#1549] fix(common): Uniformly throw RssException for external callers (#1550) +* [MINOR] test: Use sensible partition ids in ShuffleReadClientImplTest (#1545) +* [#1546] fix(spark): NPE could happen before uncompressing after #1360 (#1547) +* feat(docker): Add example docker compose Uniffle/Spark cluster (#1532) +* [#1472][part-6] fix(netty): Make UTs truly test Netty mode (#1540) +* [MINOR] improvement(tez): Only invoking LOG.debug when LOG.isDebugEnabled is true (#1541) +* [#1459] fix(server): Memory leak for exceptional scenarios when flushing events (#1537) +* [#1472] fix(client): IlegalReferenceCountException for clientReadHandler.readShuffleData (#1536) +* [#1472][part-5] Use UnpooledByteBufAllocator to fix inaccurate usedMemory issue causing OOM (#1534) +* [MINOR] refactor(common): Move blockId bit logic into common class (#1527) +* [#1373][part-1] feat(spark): partition write to multi servers leveraging from reassignment mechanism (#1445) +* [MINOR] Update dashboard pom.xml to take arguments for node and npm download locations (#1530) +* [#1316] improvement(spark): detect OutputTracker API version via Spark version (#1317) +* [#134] improvement(spark3): Use taskId and attemptNo as taskAttemptId (#1529) +* [MINOR] feat(build): Allow to build distribution without some modules (#1525) +* [#1407] fix(rust): use grpc runtime worker threads and adjust default runtime config (#1517) +* [#1407] feat(rust): fix + add total grpc request metrics (#1516) +* [#1407] chore(rust): add cpu profile doc (#1515) +* [#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it (#1521) +* [MINOR] fix(CI): Improve dashboard across the CI (#1526) +* [#1472][part-3] fix(client): Fix occasional IllegalReferenceCountException issues in extremely rare scenarios (#1522) +* [MINOR] fix(pom): Add missing shuffle-server dependencies to work with -Ptez +* [#1472][part-4] feature(server): Add metrics for Netty's pinnedDirectMemory and usedDirectMemory (#1524) +* [#1472][part-1] fix(server): Upgrade Netty and GRPC (#1520) +* [MINOR] fix(deploy): Fix invocation of kubernetes bash scripts (#1513) +* [#1476] feat(rust): Provide dedicated unregister app rpc interface (#1511) +* [#1476] feat(spark): Provide dedicated unregister app rpc interface (#1510) +* [MINOR] improvement(CI): Rework build and rust workflow events (#1508) +* [#1407] fix(rust): drop events and release memory when errors happened (#1509) +* [#1267][FOLLOWUP] improvement(client): INFO log level should be used in RetryUtils (#1500) +* [MINOR] feat(CI): Report test results in github comments (#1506) +* [#1407] fix(rust): return error when getting data from hdfs by client (#1507) +* [#1501] fix(server): storage selection cache accidentally deleted when clearing stage level data. (#1505) +* [#1407] fix(rust): dont panic when no available local disks (#1504) +* fix(rust): avoid checking storage type in runtime (#1503) +* [MINOR] build: Move dashboard module into profile and disable it by default (#1498) +* [#1497] improvement(spark): flushing buffer if the memoryUsed of the first record of `WriterBuffer` larger than bufferSize (#1485) +* [MINOR] improvement(test): Identify duplicate blocks in TestUtils.validateResult (#1495) +* [MINOR] fix: Get and increment ATOMIC_LONG in that order everywhere (#1496) +* [MINOR] docs: Improve comment on blockId structure (#1492) +* [MINOR] fix(server): Assert actual number of bitmaps matches bitNum (#1493) +* [#1490] improvement(spark3): Disable dynamic allocation shuffle tracking by default (#1491) +* [#1407] feat(rust): support more metrics about disk and topN data size (#1488) +* [#1407] feat(rust): support multiple spill policies and simplify hdfs config (#1487) +* [#1356] feat(server): improve expired buffers metric and log (#1469) +* [#1464][FOLLOWUP] improvement(spark): print abnormal shuffle servers that blocks fail to send (#1473) +* [#1467] feat(server): introduce total hdfs write data size for huge partition (#1468) +* [#1355] fix(client): Netty client will leak when decoding responses (#1455) +* [#1462] fix(server): Memory may leak when flushQueue is full (#1463) +* [#1466] feat(server): introduce the JvmPauseMonitor to detect the gc pause (#1470) +* [#1459] improvement(server): refactor DefaultFlushEventHandler and support event retry into pending queue (#1461) +* [#1464] improvement(spark): print abnormal shuffle servers that blocks fail to send (#1465) +* [#1456] improvement(client): Better exception handling when calling requireBuffer using GRPC (#1457) +* [#1428] fix(server): fallback invalid when local storage can't write (#1429) +* [#1453] improvement: Force to use the UNIX line ending when using spotless-maven-plugin (#1454) +* [#1447] feat(client): Introduce configurations to control default behavior of RPC client (#1448) +* [#1267] improvement(client): throw the detailed stacktrace when exceptions happened (#1411) +* [#1189][FOLLOWUP] fix(server): Start NettyDirectMemoryTracker. (#1432) +* [#333] feat(server): expose metrics of TopN app bytes in one shuffle server (#1400) +* [#1433] fix(server): Race conditions with ShuffleServer state (#1434) +* [MINOR] refactor: avoid unnecessary bitmap clone and AND (#1442) +* [#532] fix: spotBugs of SC_START_IN_CTOR (#1440) +* [#1435] improvement: Improve log4j settings to avoid annoying messages (#1436) +* [MINOR] refactor: Avoid unnecessary recursion (#1441) +* [#1407] feat(rust): refactor localfile store to speed up writing (#1422) +* [#1416] feat(spark): support custom hadoop config in client side (#1417) +* [#1119] improvement(client): Explicitly throw `BUFFER_LIMIT_OF_HUGE_PARTITION` (#1425) +* [#974] fix(coordinator): Dynamic remote storage conf invalid for `LegacyClientConfParser` (#1424) +* [#1420] fix(client): reportShuffleWriteFailure failed because of IndexOutOfBoundsException (#1421) +* [#1356] improvement: add metric of total expired pre-allocated buffers (#1412) +* [#1414] feat(rust): introduce native hdfs client (#1415) +* [#1024] improvement(tez): Optimize user switch to shuffle mode local/remote. (#1397) +* [#1403] fix(client): RSS client configurations are not working. (#1404) +* [#1409] fix(client): Netty Epoll is unavailable for the RSS Client. (#1410) +* [#1407] improvement(rust): Critical bug fix of getting blockIds and some optimization (#1408) +* [#825][followup] fix(spark): Fix without returning an exception. (#1402) +* [#1385] improvement: Improve log4j appender layout pattern (#1386) +* [851] improvement: Add a similar util method like ThreadUtils.parmap in the Spark (#1396) +* [#363] improvement(server): Make the coordinator client managed by CoordinatorClientFactory singleton (#1377) +* [#1391] fix(server): Direct memory may leak in exceptional scenarios in shuffle server. (#1392) +* [#1157] fix(tez): Container not exit because shuffle client is not closed +* [#460] improvement: Exit on OutOfMemoryError (#1390) +* [1387] improvement: compatibility with jdk8 when call JavaUtils.newConcurrentMap (#1389) +* [#1369] feat: Provide distribution with Hadoop dependencies (#1379) +* [#1383] [DOCS] Improve Netty's documentation (#1384) +* [#1358] fix(spark): pre-check bytebuffer whether is direct before uncompress (#1360) +* [#1364] feat(client): introduce option to control whether to use local hadoop conf (#1370) +* [MINOR] chore(client): fix the incorrect partitionId (#1376) +* [#1189] feat(server): Add netty used direct memory size metric (#1363) +* [#960] fix(dashboard): simplify dependency and correct the startup script (#1347) +* [#1348] improvement(metrics): Unify tags generation for shuffle-server metrics reporter (#1349) +* [MINOR] chore: fix kubernetes ci pipeline (#1368) +* [MINOR] fix(spark): Fix NPE for ShuffleWriteClientImpl.unregisterShuffle (#1367) +* [#960][part-4] feat(dashboard): Fix some display bugs and optimize the display format. (#1326) +* [#1267] fix(client): fast fail without retry when oom occurs (#1344) +* [#1361] feat(netty): add netty metrics into reporter (#1362) +* [#1335] fix(server)(netty): release bytebuf explicitly when requiredId is expired or cache failed (#1357) +* [MINOR] chore(client): Specify name for data transfer thread pool (#1353) +* [#1319] fix(server): Add shaded com.google.guava:failureaccess dependency to prevent NoClassDefFoundError (#1352) +* [MINOR] improvement: use mvn wrapper in CI builds. (#1351) +* [#1191][FOLLOWUP] improvement(conf): use the unified name for hybrid storage in conf (#1350) +* [#960][followup] fix(dashboard): Fix get_pid_file_name function for the dashboard. (#1346) +* [MINOR] improvement: use mvn wrapper for builds (#1345) +* [#901] feat(server): respect disk capacity watermark rather than uniffle capacity (#1337) +* [#1342] improvement(server): dump appId when clearing resource fails (#1343) +* [#1110] improvement(coordinator): introduce pluggable remote storage config format (#1329) +* [#1330] improvement: optimize tips for checking replica settings (#1334) +* [#1187] feat(netty): Netty Encoder Support zero-copy. (#1313) +* [#960][part-3] feat(dashboard): Provides a start-stop script for the dashboard. (#1056) +* [#1308] improvement(rust): detect whether data has been purged in UT (#1323) +* [#1213] feat(rust): Support block filter by taskId when getting memory data (#1311) +* [#1290] improvement(operator): Avoid accidentally deleting data of other services when misconfiguring the mounting directory (#1291) +* [MINOR] fix: flaky test ShuffleTaskManagerTest#checkAndClearLeakShuffleDataTest (#1320) +* [MINOR] test: flaky test GrpcServerTest.testGrpcExecutorPool (#1321) +* [#960][part-2] feat(dashboard): Add a dashboard front-end module. (#1055) +* [#825][part-7] feat(spark): Write Stage resubmit and dynamic shuffle server assign integration tests. (#1148) +* [#1300] feat(mr): Support combine operation in map stage for mr engine. (#1301) +* [#1309] fix(spark): WriteBufferManager in Spark2 does not use a reassigned shuffle server. (#1310) +* [#1307] feat(rust): make each thread listen the socket to improve throughput in tonic (#1306) +* [#960][part-1] feat(dashboard): Add some dashboard interfaces. (#1053) +* [#825][part-6] feat(spark): Added logic that failed to send ShuffleServer. (#1147) +* [#1293] feat(rust): Add total_read_data metric (#1298) +* [#1094] docs: split client_guide.md (#1299) +* [#1221] feat(rust): Support grpc server graceful shutdown (#1292) +* [#1294] feat(rust): introduce the unified grpc latency metrics for all requests (#1295) +* [#1296] improvement(rust): use std.sync.lock to replace tokio lock for better performance (#1216) +* [#825][part-5] feat(spark): Adds the RPC interface to reassign the ShuffleServer list. (#1146) +* [MINOR] docs: update jar name for spark client (#1289) +* [MINOR] chore: add scripts for publishing tarballs to svn (#1284) +* [#1286] improvement(server): Add RemoveResourceTime Metric (#1288) +* [#1271] improvement(server): change transportTime and processTime summary to Thread Pool Instead of block (#1272) +* [#1269] fix(tez): uniqueMapId may be not unique when more than one fetcher are working. (#1270) +* [#1246] feat(tez): Support remote spill for unordered input. (#1250) +* [#825][part-4] feat(spark): Report write failures to ShuffleManager. (#1258) +* [MINOR] fix: missing to build spark shaded modules (#1282) +* [#1275] chore: add scripts for publishing maven releases (#1281) +* [#1274] feat: add shaded module for spark2 client (#1280) +* [#1273] feat: add shaded module for spark3 client (#1279) +* [#825][part-3] feat(spark): Get the ShuffleServer corresponding to the partition from ShuffleManager. (#1141) +* [#1277] chore: add flatten maven plugin (#1278) +* [#1252] fix(server): Incorrect storage write fail metric (#1253) +* [#825][FOLLOWUP] fix(spark): Apply a thread safety way to track the blocks sending result (#1260) +* [#1254][FOLLOWUP] fix(test): Fix the flaky test RssShuffleTest. (#1259) +* [#1261] fix(spark): Throw out InterruptedException for sleep in requestExecutorMemory #1262 +* [#1256] refactor: optimize collections contruction (#1257) +* [#1254] fix(test): Fix the flaky test RssShuffleTest. (#1255) +* [#825][part-2] feat(spark): Report failed blocks and a list of ShuffleServer. (#1138) +* [#244][FOLLOWUP] test: CoordinatorGrpcTest.rpcMetricsTest. (#1251) +* [#1231] feat(tez): Support remote spill in merge stage. (#1245) +* [#1243] fix(test): Fix the flaky test `SparkSQLTest` and `RepartitionTest` (#1244) +* [#1089] feat(spark): Add dynamic allocation patch for Spark 2.3 (#1242) +* [#1237] feat(rust): support populating args by clap (#1236) +* [#1088] feat(spark): Add dynamic allocation patch for Spark 3.0 (#1241) +* [#1234] improvement(rust): separate runtimes for different overload (#1233) +* [#1090] refactor: Refactor the reader code with builder pattern (#1232) +* [#1219] fix(test): Fix the flaky test `WriteAndReadMetricsTest` (#1235) +* [#1206] chore(rust): ignore generated proto code in git (#1229) +* [#1091] refactor: Refactor the writer code with builder pattern (#1228) +* [MINOR] Fix kubernetest CI pipeline (#1227) +* [#802] feat(spark): Implement ShuffleDataIo (#1226) +* [#825][part-1] feat(spark): Add the RPC interface for reassigning ShuffleServer (#1137) +* [#1085] feat(spark): Add dynamic allocation patch for Spark 3.4 (#1225) +* [#1201] improvement: only invoking LOG.debug when LOG.isDebugEnabled() is true (#1217) +* [#1084] feat: Add dynamic allocation patch for Spark 3.3 (#1224) +* [#1083] feat(spark): Support Spark 3.5 (#1223) +* [#1211] fix(server): unexpectedly removing resources when app has re-registered shuffle later (#1212) +* [#1206] chore(rust): remove the auto-generated proto code (#1218) +* [#1209] improvement(server): Speed up cleanupStorageSelectionCache method in LocalStorageManager. (#1210) +* [#1206][part-2] feat(rust): introduce rust based shuffle-server (#1208) +* [#1206][part-1] feat(rust): create folder for rust-based shuffle server (#1207) +* [#1204] chores(ci): Fix the ci pipeline of Kubernetes #1205 +* [#1202] improvement: Add HealthScriptChecker for execute special health check shell script (#1203) +* [#1198] improvement: zerocopy from Protobuf's ByteString to Netty's ByteBuf (#1199) +* [#1192] improvement(hdfs): Add `RSS_SECURITY_HADOOP_KERBEROS_PROXY_USER_ENABLE` conf for storing shuffle data (#1194) +* [MINOR] refactor: Rename MultiStorage to HybridStorage (#1191) +* [MINOR] Remove extra directory (#1190) +* [#1178] improvement: set `rss.coordinator.quota.default.app.num` default -1 to indicate no quota check (#1186) +* [#1182] fix(operator): The LeaderElectionNamespace of the rss-controller is hard-coded to kube-system. (#1183) +* [#1175] fix(netty): Retry failed with StacklessClosedChannelException after channel closed (#1181) +* [#1177] improvement: Reduce the write time of tasks (#1179) +* [MINOR] docs: Fix spark.serializer in README and client_guide (#1180)
