[ 
https://issues.apache.org/jira/browse/SYSTEMML-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553722#comment-15553722
 ] 

Mike Dusenberry commented on SYSTEMML-951:
------------------------------------------

[~mboehm7] Excellent, this looks like a great plan with a great start 
addressing the first two options.

Below, I have included a few versions of the script that I am using for a 
project.  In this project, there are (currently) 3.7M rows and 65K "columns" 
stored in vector form (with the goal of instead using ~200K columns with 
colored images, and additionally more rows).  The data resides in a set of 
Spark {{DataFrames}} (training, validation splits) that are loaded from Parquet 
files on disk, and are ~2TB, which exceeds the amount of aggregate memory 
(~700GB).  These DataFrames essentially contain a "label" column and a "sample" 
{{vector}} column.  To integrate with SystemML, I am using the PySpark new 
MLContext API.  In all cases, I am using an initial simple algorithm (softmax 
classifier), with the intent to just swap in a larger convolutional neural net 
once the time-based performance was adequate.  For both algorithms, the 
"plumbing" of getting data into and out of the algorithm functions remains the 
same.  Assume all data has already been randomized beforehand.

Version 1:
* {{mnist_softmax_v1.dml}}
* Originally, I wanted to feed the DataFrame into a DML script as a SystemML 
{{matrix}}, split into {{X}} and {{Y}} matrices, then pass those into a 
DML-bodied {{mnist_softmax_v1::train}} function for the given algorithm.  
Within the algorithm, I would loop through the data, pulling out small 
{{batch_size}} mini-batches for training, one mini-batch after the other.
* Pros: Clean algorithm dealing only with {{X}} and {{Y}} matrices.
* Cons: Right-indexing for a mini-batch requires a full read & shuffle of the 
DataFrame -- lots of time.  Conversion cost of DataFrame to matrix.

Version 2:
* {{mnist_softmax_v2.dml}}
* To avoid the shuffle, in this version I wanted to feed the DataFrame in as a 
SystemML {{frame}}, pass that {{frame}} into a {{mnist_softmax_v2::train}} 
function, extract out mini-batches of the frame within a loop, convert the 
mini-batch into a matrix, split the matrix into {{X}} and {{Y}} matrices, and 
train.
* Pros: Lightweight conversion from {{DataFrame}} to {{frame}}.  No shuffle 
during right-indexing.
* Cons: Right-indexing for a mini-batch requires a full read of the DataFrame 
(~13 mins after initial cache -- still out of core).  Slightly more messy since 
dealing with {{frame}} and {{matrix}} types.

Version 3:
* Same as Version 2 above, but instead of passing a DataFrame directly in, I 
instead ran a prior batch job to save a DataFrame as a SystemML {{frame}} on 
disk, then read from disk instead of the DataFrame.
* Cons: Same issues as above.  Requires a prior batch job to save to disk.

---

Overall, I think the multi-block lookups will help a ton with this situation, 
and looking forward to them!

> Efficient spark right indexing via lookup
> -----------------------------------------
>
>                 Key: SYSTEMML-951
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-951
>             Project: SystemML
>          Issue Type: Task
>          Components: Runtime
>            Reporter: Matthias Boehm
>            Assignee: Matthias Boehm
>
> So far all versions of spark right indexing instructions require a full scan 
> over the data set. In case of existing partitioning (which anyway happens for 
> any external format - binary block conversion) such a full scan is 
> unnecessary if we're only interested in a small subset of the data. This task 
> adds an efficient right indexing operation via 'rdd lookups' which access at 
> most <num_lookup> partitions given existing hash partitioning. 
> cc [~mwdus...@us.ibm.com]  
> In detail, this task covers the following improvements for spark matrix right 
> indexing. Frames are not covered here because they allow variable-length 
> blocks. Also, note that it is important to differentiate between in-core and 
> out-of-core matrices: for in-core matrices (i.e., matrices that fit in 
> deserialized form into aggregated memory), the full scan is actually not 
> problematic as the filter operation only scans keys without touching the 
> actual values.
> (1) Scan-based indexing w/o aggregation: So far, we apply aggregations to 
> merge partial blocks very conservatively. However, if the indexing range is 
> block aligned (e.g., dimension start at block boundary or range within single 
> block) this is unnecessary. This alone led to a 2x improvement for indexing 
> row batches out of an in-core matrix.
> (2) Single-block lookup: If the indexing range covers a subrange of a single 
> block, we directly perform a lookup. On in-core matrices this gives a minor 
> improvement (but does not hurt) while on out-of-core matrices, the 
> improvement is huge in case of existing partitioner as we only have to scan a 
> single partition instead of the entire data.
> (3) Multi-block lookups: Unfortunately, Spark does not provide a lookup for a 
> list of keys. So the next best option is a data-query join (in case of 
> existing partitioner) with {{data.join(filter).map()}}, which works very well 
> for in-core data sets, but for out-of-core datasets, unfortunately, does not 
> exploit the potential for partition pruning and thus reads the entire data. I 
> also experimented with a custom multi-block lookup that runs multiple lookups 
> in a multi-threaded fashion - this gave the expected pruning but was very 
> ugly due to an unbounded number of jobs. 
> In conclusion, I'll create a patch for scenarios (1) and (2), while scenario 
> (3) requires some more thoughts and is postponed after the 0.11 release. One 
> idea would be to create a custom RDD that implements {{lookup(List<T> keys)}} 
> by constructing a pruned set of input partitions via 
> {{partitioner.getPartition(key)}}. cc [~freiss] [~niketanpansare] [~reinwald]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to