[GitHub] spark pull request: [SPARK-14289][WIP] Support multiple eviction s...

Earne Wed, 06 Apr 2016 02:04:23 -0700

Github user Earne commented on the pull request:

https://github.com/apache/spark/pull/12162#issuecomment-206238646

@rxin The use case that motivate this is about below.

- Java objects consume a factor of 2-5x more space than the ârawâ data
inside their fields.

- Running graphx.LiveJournalPageRank example on a 8 nodes cluster (1 work
as Master, each configured with 45GB memory for Spark running in legacy memory
management mode). The dataset (about 30GB) is generated by HiBench, while
running 5 iterations, time of each iteration is getting worse and worse.

- By analyzing the log file, I realize that it is because memory space for
cached RDD is not sufficient, and lots of partition with high recomputing cost
is dropped. Recomputing these partitions brought in lots of time.

- FIFO can be implemented by initialize
[entries](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L90)
with LinkedHashMap\[BlockId, MemoryEntry\[_\]\](32, 0.75f, false). And even
FIFO can get much better performance than LRU.

- Storage level such as MEMORY_AND_DISK may partial solve the problem, but
the effect is not very good.

An eviction strategy taken the computing cost into consideration may work
well (even in unified memory mode or use the MEMORY_AND_DISK level). Some
cost-aware replacement policy already exists in K-V stores, such as
GD-Wheel(EuroSysâ15).

This PR can be separated to below sub-task.
- [ ] Refactor to support more than one policy (LRU, FIFO, LFU).

- [ ] Add a policy that taken the computing cost into consideration.

- [ ] Taken serialize and deserialize cost into consideration.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14289][WIP] Support multiple eviction s...

Reply via email to