[
https://issues.apache.org/jira/browse/IGNITE-4037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582331#comment-15582331
]
Ivan Veselovsky commented on IGNITE-4037:
-----------------------------------------
The main point there is that total amount of memory we have (heap, off-heap,
mapped, whatever.) is limited, and is not very large. So, if we use mapped
memory, we can only use a small (limited size) memory window(s) that we can
move over a huge file(s). This window re-positioning is expensive operation,
and this is why stream access is much efficient on files than random I/O.
If we use HadoopSkipList (or any other sorted collection as well), when we
mutate it (e.g. insert a new elements into the sorted collection), we need to
re-position the memory window randomly to find the insertion point and write
the new k-V pair. This leads to random access on disk (in case of mapped memory
many window re-positionings), and drops the performance.
The alternative is to perform sorting (random access) in a small (fixed size)
memory buffer, then spill it to disk, and then merge the sorted sequences
together using the merge sort algorithm step. This uses only stream-like disk
access, and, afaik, this is what both Hadoop and Spark do. Currently I don't
see better alternative to that, so my suggestion is to use similar
implementation in Ignite.
As a memory buffer for sorting HadoopSkipList can be used. The only change I
would suggest is to use org.apache.hadoop.io.RawComparator#compare because it
allows to avoid full deserialization for keys comparison, though, still
requires to get an on-heap byte array. For that reason may be an on-heap
collection is even more preferable.
> High memory consumption when executing TeraSort Hadoop example
> --------------------------------------------------------------
>
> Key: IGNITE-4037
> URL: https://issues.apache.org/jira/browse/IGNITE-4037
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 1.6
> Reporter: Ivan Veselovsky
> Assignee: Ivan Veselovsky
> Fix For: 1.7
>
>
> When executing TeraSort Hadoop example, we observe high memory consumption
> that frequently leads to cluster malfunction.
> The problem can be reproduced in unit test, even with 1 node, and with not
> huge input data set as 100Mb.
> Dump analysis shows that memory is taken in various queues:
> org.apache.ignite.internal.processors.hadoop.taskexecutor.HadoopExecutorService#queue
>
> and
> task queue of
> org.apache.ignite.internal.processors.hadoop.jobtracker.HadoopJobTracker#evtProcSvc
> .
> Since objects stored in these queues hold byte arrays of significant size,
> memory if consumed very fast.
> It looks like real cause of the problem is that some tasks are blocked.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)