[
https://issues.apache.org/jira/browse/NUTCH-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-3149:
----------------------------------------
Summary: Investigate Remote Shuffle Service Integration (was: Investigate
Remote Shuffle Service Integration (Apache Uniffle / Celeborn) for
Shuffle-Intensive Nutch Jobs)
> Investigate Remote Shuffle Service Integration
> ----------------------------------------------
>
> Key: NUTCH-3149
> URL: https://issues.apache.org/jira/browse/NUTCH-3149
> Project: Nutch
> Issue Type: Improvement
> Components: shuffle
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
>
> Several core jobs are shuffle-intensive, meaning they spend significant time
> and resources moving intermediate data between map and reduce phases. Remote
> shuffle services (such as Apache Uniffle and Apache Celeborn) offload this
> work to dedicated nodes, potentially improving:
> * Job execution time
> * Fault tolerance
> * Resource utilization
> * Scalability for large crawls
> h4. Shuffle-Intensive Nutch Jobs Identified
> ||Job||Shuffle Intensity||Key Characteristics||
> |LinkRank|Extreme|Iterative (10+ iterations), invert + analyze per iteration|
> |WebGraph|Very High|3 sequential jobs (OutlinkDb → InlinkDb → NodeDb)|
> |LinkDb|High|Full link inversion across segments|
> |CrawlDb|High|URL grouping from multiple segments|
> |Generator|Medium-High|Score-based selection + host/domain partitioning|
> |Indexer|Medium-High|Multi-source aggregation (CrawlDb, LinkDb, segments)|
> |Deduplication|Medium|Content signature-based grouping|
> |Fetcher|Low|Primarily map-only, minimal shuffle|
> h4. Key Metrics for Measuring Shuffle Intensity
> Shuffle intensity can be determined from Hadoop's built-in TaskCounter and
> FileSystemCounter groups:
> * MAP_OUTPUT_BYTES / HDFS_BYTES_READ → Shuffle Ratio
> * SPILLED_RECORDS / MAP_OUTPUT_RECORDS → Spill Ratio
> * FILE_BYTES_READ + FILE_BYTES_WRITTEN → Local disk I/O overhead
> Jobs with Shuffle Ratio > 2.0x or Spill Ratio > 2.0x are strong candidates
> for remote shuffle optimization.
> An initial step would be to implement a configurable (off/off switch)
> reporting capability to write shuffle reports to the Nutch log. This will
> enable Nutch admins to determine over time if a remote shuffle candidate
> would benefit their Nutch jobs.
> A future ticket will establish an experiment (comparative analysis) for Nutch
> shuffle data between Hadoop native (no changes), Uniffle and Celeborn.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)