[
https://issues.apache.org/jira/browse/SYSTEMML-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691852#comment-15691852
]
Matthias Boehm commented on SYSTEMML-1127:
------------------------------------------
Thanks for taking care of this issue [~fschueler]. Yes, this is indeed a bit
tricky. First of all, for REMOTE_MR, each task has its own buffer pool, while
for REMOTE_SPARK all cores per executor share the same (thread-safe) buffer
pool because we use a buffer pool per process. This is good and bad. The good
thing is that global reads (e.g., a dataset read by all workers, which is often
the case for hyper-parameter tuning) is only read once. The bad thing is that
we need to synchronize some setup procedures. Furthermore, the individually
names of evicted matrices/frames already use a process-wide unique ID. So the
only problem is the creation of the cache directory. I would recommend to
synchronize the entire buffer pool setup which includes (1) the creation of the
cache directory, and (2) init caching. The subsequent append of task attempt
IDs can be removed as this would anyway modify a global static variable.
> Distributed unique IDs are not unique
> -------------------------------------
>
> Key: SYSTEMML-1127
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1127
> Project: SystemML
> Issue Type: Bug
> Components: ParFor
> Reporter: Felix Schüler
>
> When executing a Spark parfor, the SparkParforWorker throws an exception
> which states that the localtmpdir could not be created. This is due to the
> fact that multiple executors are running multithreaded on the same worker.
> The createDistributedUniqueID() method in the IDHander.java creates unique
> IDs only per pid and host, not per thread. This could potentially be solved
> by adding the threadID to the unique ID. The question is if every thread
> should have its own cache or if the logic should be changed so that the first
> creation will be successful and then the threads share one cache.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)