[ 
https://issues.apache.org/jira/browse/SYSTEMML-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606281#comment-15606281
 ] 

Matthias Boehm commented on SYSTEMML-951:
-----------------------------------------

those are all good points and it's certainly useful to explore this further. 
Let me just add a couple of clarifications:

1) The conversion is actually happening on the "mapper-side" before the shuffle 
as wide rows are split into multiple partial blocks that are potentially sent 
to different nodes. However, the fetch (from remote nodes) and merge can be 
done lazily. 

2) The reason why we have the guard for out-of-core datasets is because in this 
scenario, the partition pruning will never hurt performance. It is important to 
remember that the in-memory filter just scans keys, which are for dense 
matrices, ~250,000x smaller than the actual dataset. We should simply run a 
series of experiments with different matrix shapes and index ranges that cover 
the full spectrum, i.e., only two blocks touched vs all blocks touched.

3) If (2) is non-conclusive, we might want to explore to take the RDD 
storageinfo in account (which provides various stats of in-memory size and 
on-disk size) to handle the case of multiple rdds that globally lead to 
out-of-core scenarios.

> Efficient spark right indexing via lookup
> -----------------------------------------
>
>                 Key: SYSTEMML-951
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-951
>             Project: SystemML
>          Issue Type: Task
>          Components: Runtime
>            Reporter: Matthias Boehm
>            Assignee: Matthias Boehm
>         Attachments: mnist_softmax_v1.dml, mnist_softmax_v2.dml
>
>
> So far all versions of spark right indexing instructions require a full scan 
> over the data set. In case of existing partitioning (which anyway happens for 
> any external format - binary block conversion) such a full scan is 
> unnecessary if we're only interested in a small subset of the data. This task 
> adds an efficient right indexing operation via 'rdd lookups' which access at 
> most <num_lookup> partitions given existing hash partitioning. 
> cc [[email protected]]  
> In detail, this task covers the following improvements for spark matrix right 
> indexing. Frames are not covered here because they allow variable-length 
> blocks. Also, note that it is important to differentiate between in-core and 
> out-of-core matrices: for in-core matrices (i.e., matrices that fit in 
> deserialized form into aggregated memory), the full scan is actually not 
> problematic as the filter operation only scans keys without touching the 
> actual values.
> (1) Scan-based indexing w/o aggregation: So far, we apply aggregations to 
> merge partial blocks very conservatively. However, if the indexing range is 
> block aligned (e.g., dimension start at block boundary or range within single 
> block) this is unnecessary. This alone led to a 2x improvement for indexing 
> row batches out of an in-core matrix.
> (2) Single-block lookup: If the indexing range covers a subrange of a single 
> block, we directly perform a lookup. On in-core matrices this gives a minor 
> improvement (but does not hurt) while on out-of-core matrices, the 
> improvement is huge in case of existing partitioner as we only have to scan a 
> single partition instead of the entire data.
> (3) Multi-block lookups: Unfortunately, Spark does not provide a lookup for a 
> list of keys. So the next best option is a data-query join (in case of 
> existing partitioner) with {{data.join(filter).map()}}, which works very well 
> for in-core data sets, but for out-of-core datasets, unfortunately, does not 
> exploit the potential for partition pruning and thus reads the entire data. I 
> also experimented with a custom multi-block lookup that runs multiple lookups 
> in a multi-threaded fashion - this gave the expected pruning but was very 
> ugly due to an unbounded number of jobs. 
> In conclusion, I'll create a patch for scenarios (1) and (2), while scenario 
> (3) requires some more thoughts and is postponed after the 0.11 release. One 
> idea would be to create a custom RDD that implements {{lookup(List<T> keys)}} 
> by constructing a pruned set of input partitions via 
> {{partitioner.getPartition(key)}}. cc [~freiss] [~niketanpansare] [~reinwald]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to