[ https://issues.apache.org/jira/browse/SYSTEMML-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569582#comment-15569582 ]
Mike Dusenberry commented on SYSTEMML-951: ------------------------------------------ Alright, I figured out what the issue was. In {{isMultiBlockLookup}} we check that the index range is block aligned with {{OptimizerUtils.isIndexingRangeBlockAligned}}. In that method, we aim to grab the rows & cols per block. However, instead we end up grabbing the number of row blocks & column blocks on [line 896 | https://github.com/apache/incubator-systemml/blob/80b463345030e0b3cd22cdeadc12b1f425dfec4d/src/main/java/org/apache/sysml/hops/OptimizerUtils.java#L896-L896]. I have 3.74M rows x 65536 columns so once I try to index {{X\[3701:3750,\]}}, the {{brlen}} would be erroneously set to {{3745}} (instead of 1000) and {{bclen}} to {{65}} (instead of 1000), and thus, the {{(rl-1)/brlen == (ru-1)/brlen && (cl-1)%bclen == 0}} check fails. The ways bugs express themselves are always interesting! I'll push the fix soon. > Efficient spark right indexing via lookup > ----------------------------------------- > > Key: SYSTEMML-951 > URL: https://issues.apache.org/jira/browse/SYSTEMML-951 > Project: SystemML > Issue Type: Task > Components: Runtime > Reporter: Matthias Boehm > Assignee: Matthias Boehm > Attachments: mnist_softmax_v1.dml, mnist_softmax_v2.dml > > > So far all versions of spark right indexing instructions require a full scan > over the data set. In case of existing partitioning (which anyway happens for > any external format - binary block conversion) such a full scan is > unnecessary if we're only interested in a small subset of the data. This task > adds an efficient right indexing operation via 'rdd lookups' which access at > most <num_lookup> partitions given existing hash partitioning. > cc [~mwdus...@us.ibm.com] > In detail, this task covers the following improvements for spark matrix right > indexing. Frames are not covered here because they allow variable-length > blocks. Also, note that it is important to differentiate between in-core and > out-of-core matrices: for in-core matrices (i.e., matrices that fit in > deserialized form into aggregated memory), the full scan is actually not > problematic as the filter operation only scans keys without touching the > actual values. > (1) Scan-based indexing w/o aggregation: So far, we apply aggregations to > merge partial blocks very conservatively. However, if the indexing range is > block aligned (e.g., dimension start at block boundary or range within single > block) this is unnecessary. This alone led to a 2x improvement for indexing > row batches out of an in-core matrix. > (2) Single-block lookup: If the indexing range covers a subrange of a single > block, we directly perform a lookup. On in-core matrices this gives a minor > improvement (but does not hurt) while on out-of-core matrices, the > improvement is huge in case of existing partitioner as we only have to scan a > single partition instead of the entire data. > (3) Multi-block lookups: Unfortunately, Spark does not provide a lookup for a > list of keys. So the next best option is a data-query join (in case of > existing partitioner) with {{data.join(filter).map()}}, which works very well > for in-core data sets, but for out-of-core datasets, unfortunately, does not > exploit the potential for partition pruning and thus reads the entire data. I > also experimented with a custom multi-block lookup that runs multiple lookups > in a multi-threaded fashion - this gave the expected pruning but was very > ugly due to an unbounded number of jobs. > In conclusion, I'll create a patch for scenarios (1) and (2), while scenario > (3) requires some more thoughts and is postponed after the 0.11 release. One > idea would be to create a custom RDD that implements {{lookup(List<T> keys)}} > by constructing a pruned set of input partitions via > {{partitioner.getPartition(key)}}. cc [~freiss] [~niketanpansare] [~reinwald] -- This message was sent by Atlassian JIRA (v6.3.4#6332)