[ 
https://issues.apache.org/jira/browse/SYSTEMML-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567032#comment-15567032
 ] 

Mike Dusenberry commented on SYSTEMML-951:
------------------------------------------

[~mboehm7] Okay I tried the latest patch.  A few thoughts:

1. There is a serialization issue due to {{PartitionPruningFunction}} not being 
serializable.  I can fix this with a simple {{implements Serializable}}.
2. Now that we have this patch, what process would you suggest to get the most 
*performance*, assuming I start with a {{DataFrame}} and use MLContext?  I.e.:
  * Should I pass in a {{DataFrame}} directly as a {{matrix}}, or as a 
{{frame}} and then convert to a {{matrix}} in DML?
  * Should I run one script that does the conversion from the {{DataFrame}} to 
a {{matrix}}, get that {{matrix}} out from MLContext as a {{Matrix}}, and then 
pass that {{Matrix}} into the algorithm script?  Or should I just pass the 
{{DataFrame}} into the algorithm script.  Assume that the algorithm script may 
need to be run several times with the same data as the math is changed.
  * Given that I have labels and features, should I first stuff these all 
together into a single vector per row of the {{DataFrame}}, then pass this 
{{DataFrame}} in and index out the labels and features column-wise into {{Y}} 
and {{X}} matrices within DML?  Or should I split the {{DataFrame}} into 
separate label and feature DataFrames each with the same corresponding 
"__INDEX" column, and then pass those in separately to MLContext?
  * If I do index labels and features out column-wise within DML, should I do 
that split prior to the iterative mini-batch loop, or within it?
  * Should I create my own row index ("__INDEX") column in the {{DataFrame}}, 
or should I just let SystemML assign a row index during conversions?
  * etc.

In general, I think formulating a good set of best practices for top 
performance, and possibly expanding the engine/optimizer to cover some of these 
cases could be really beneficial.

> Efficient spark right indexing via lookup
> -----------------------------------------
>
>                 Key: SYSTEMML-951
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-951
>             Project: SystemML
>          Issue Type: Task
>          Components: Runtime
>            Reporter: Matthias Boehm
>            Assignee: Matthias Boehm
>         Attachments: mnist_softmax_v1.dml, mnist_softmax_v2.dml
>
>
> So far all versions of spark right indexing instructions require a full scan 
> over the data set. In case of existing partitioning (which anyway happens for 
> any external format - binary block conversion) such a full scan is 
> unnecessary if we're only interested in a small subset of the data. This task 
> adds an efficient right indexing operation via 'rdd lookups' which access at 
> most <num_lookup> partitions given existing hash partitioning. 
> cc [[email protected]]  
> In detail, this task covers the following improvements for spark matrix right 
> indexing. Frames are not covered here because they allow variable-length 
> blocks. Also, note that it is important to differentiate between in-core and 
> out-of-core matrices: for in-core matrices (i.e., matrices that fit in 
> deserialized form into aggregated memory), the full scan is actually not 
> problematic as the filter operation only scans keys without touching the 
> actual values.
> (1) Scan-based indexing w/o aggregation: So far, we apply aggregations to 
> merge partial blocks very conservatively. However, if the indexing range is 
> block aligned (e.g., dimension start at block boundary or range within single 
> block) this is unnecessary. This alone led to a 2x improvement for indexing 
> row batches out of an in-core matrix.
> (2) Single-block lookup: If the indexing range covers a subrange of a single 
> block, we directly perform a lookup. On in-core matrices this gives a minor 
> improvement (but does not hurt) while on out-of-core matrices, the 
> improvement is huge in case of existing partitioner as we only have to scan a 
> single partition instead of the entire data.
> (3) Multi-block lookups: Unfortunately, Spark does not provide a lookup for a 
> list of keys. So the next best option is a data-query join (in case of 
> existing partitioner) with {{data.join(filter).map()}}, which works very well 
> for in-core data sets, but for out-of-core datasets, unfortunately, does not 
> exploit the potential for partition pruning and thus reads the entire data. I 
> also experimented with a custom multi-block lookup that runs multiple lookups 
> in a multi-threaded fashion - this gave the expected pruning but was very 
> ugly due to an unbounded number of jobs. 
> In conclusion, I'll create a patch for scenarios (1) and (2), while scenario 
> (3) requires some more thoughts and is postponed after the 0.11 release. One 
> idea would be to create a custom RDD that implements {{lookup(List<T> keys)}} 
> by constructing a pruned set of input partitions via 
> {{partitioner.getPartition(key)}}. cc [~freiss] [~niketanpansare] [~reinwald]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to