[
https://issues.apache.org/jira/browse/SYSTEMML-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832648#comment-15832648
]
Mike Dusenberry commented on SYSTEMML-951:
------------------------------------------
[~niketanpansare] Exploring the removal of the out-of-core requirement
discussed in the last few conversations will help with consistent performance
for SYSTEMML-1160.
Basically, we have a fast {{PartitionPruningRDD}} that we use effectively to
filter an RDD for indices that is faster that {{filter}} for out-of-core
datasets. However, the current checks only apply if the RDD is larger than
aggregate memory (thus "out-of-core"). The (common) edge case is when that RDD
doesn't fit into memory due to other RDDs *also* being present, even though it
would fit into memory by itself. In that case, we still use {{filter}}, even
though it is effectively an out-of-core case, and thus we have terrible
performance. It's worth (1) fully experimenting to see if
{{PartitionPruningRDD}} is *always* faster than, or *just as fast* as,
{{filter}}, and if not, then (2) checking for the edge case in which the global
memory usage of all RDDs would lead to an out-of-core situation. This is the
discussion going on in these last few comments with [~mboehm7] and I.
Important for SYSTEMML-1160, as that approach will be dependent on consistent
performance of the right-indexing operation to begin with.
> Efficient spark right indexing via lookup
> -----------------------------------------
>
> Key: SYSTEMML-951
> URL: https://issues.apache.org/jira/browse/SYSTEMML-951
> Project: SystemML
> Issue Type: Task
> Components: Runtime
> Reporter: Matthias Boehm
> Assignee: Matthias Boehm
> Attachments: mnist_softmax_v1.dml, mnist_softmax_v2.dml
>
>
> So far all versions of spark right indexing instructions require a full scan
> over the data set. In case of existing partitioning (which anyway happens for
> any external format - binary block conversion) such a full scan is
> unnecessary if we're only interested in a small subset of the data. This task
> adds an efficient right indexing operation via 'rdd lookups' which access at
> most <num_lookup> partitions given existing hash partitioning.
> cc [[email protected]]
> In detail, this task covers the following improvements for spark matrix right
> indexing. Frames are not covered here because they allow variable-length
> blocks. Also, note that it is important to differentiate between in-core and
> out-of-core matrices: for in-core matrices (i.e., matrices that fit in
> deserialized form into aggregated memory), the full scan is actually not
> problematic as the filter operation only scans keys without touching the
> actual values.
> (1) Scan-based indexing w/o aggregation: So far, we apply aggregations to
> merge partial blocks very conservatively. However, if the indexing range is
> block aligned (e.g., dimension start at block boundary or range within single
> block) this is unnecessary. This alone led to a 2x improvement for indexing
> row batches out of an in-core matrix.
> (2) Single-block lookup: If the indexing range covers a subrange of a single
> block, we directly perform a lookup. On in-core matrices this gives a minor
> improvement (but does not hurt) while on out-of-core matrices, the
> improvement is huge in case of existing partitioner as we only have to scan a
> single partition instead of the entire data.
> (3) Multi-block lookups: Unfortunately, Spark does not provide a lookup for a
> list of keys. So the next best option is a data-query join (in case of
> existing partitioner) with {{data.join(filter).map()}}, which works very well
> for in-core data sets, but for out-of-core datasets, unfortunately, does not
> exploit the potential for partition pruning and thus reads the entire data. I
> also experimented with a custom multi-block lookup that runs multiple lookups
> in a multi-threaded fashion - this gave the expected pruning but was very
> ugly due to an unbounded number of jobs.
> In conclusion, I'll create a patch for scenarios (1) and (2), while scenario
> (3) requires some more thoughts and is postponed after the 0.11 release. One
> idea would be to create a custom RDD that implements {{lookup(List<T> keys)}}
> by constructing a pruned set of input partitions via
> {{partitioner.getPartition(key)}}. cc [~freiss] [~niketanpansare] [~reinwald]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)