[ 
https://issues.apache.org/jira/browse/NUTCH-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3149:
----------------------------------------
    Summary: Investigate Remote Shuffle Service Integration  (was: Investigate 
Remote Shuffle Service Integration (Apache Uniffle / Celeborn) for 
Shuffle-Intensive Nutch Jobs)

> Investigate Remote Shuffle Service Integration
> ----------------------------------------------
>
>                 Key: NUTCH-3149
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3149
>             Project: Nutch
>          Issue Type: Improvement
>          Components: shuffle
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>
> Several core jobs are shuffle-intensive, meaning they spend significant time 
> and resources moving intermediate data between map and reduce phases. Remote 
> shuffle services (such as Apache Uniffle and Apache Celeborn) offload this 
> work to dedicated nodes, potentially improving:
>  * Job execution time
>  * Fault tolerance
>  * Resource utilization
>  * Scalability for large crawls
> h4. Shuffle-Intensive Nutch Jobs Identified
> ||Job||Shuffle Intensity||Key Characteristics||
> |LinkRank|Extreme|Iterative (10+ iterations), invert + analyze per iteration|
> |WebGraph|Very High|3 sequential jobs (OutlinkDb → InlinkDb → NodeDb)|
> |LinkDb|High|Full link inversion across segments|
> |CrawlDb|High|URL grouping from multiple segments|
> |Generator|Medium-High|Score-based selection + host/domain partitioning|
> |Indexer|Medium-High|Multi-source aggregation (CrawlDb, LinkDb, segments)|
> |Deduplication|Medium|Content signature-based grouping|
> |Fetcher|Low|Primarily map-only, minimal shuffle|
> h4. Key Metrics for Measuring Shuffle Intensity
> Shuffle intensity can be determined from Hadoop's built-in TaskCounter and 
> FileSystemCounter groups:
>  * MAP_OUTPUT_BYTES / HDFS_BYTES_READ → Shuffle Ratio
>  * SPILLED_RECORDS / MAP_OUTPUT_RECORDS → Spill Ratio
>  * FILE_BYTES_READ + FILE_BYTES_WRITTEN → Local disk I/O overhead
> Jobs with Shuffle Ratio > 2.0x or Spill Ratio > 2.0x are strong candidates 
> for remote shuffle optimization.
> An initial step would be to implement a configurable (off/off switch) 
> reporting capability to write shuffle reports to the Nutch log. This will 
> enable Nutch admins to determine over time if a remote shuffle candidate 
> would benefit their Nutch jobs.
> A future ticket will establish an experiment (comparative analysis) for Nutch 
> shuffle data between Hadoop native (no changes), Uniffle and Celeborn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to