Jenkins build is back to normal : SystemML-DailyTest #967

2017-05-03 Thread jenkins
See 



Re: Standard code styles for DML and Java?

2017-05-03 Thread Deron Eriksson
Hi Matthias,

I like your suggestion of space indentation for DML scripts and tab
indentation for Java. I definitely support this and think this would be a
great way to go.

I also really like the idea of standardizing other aspects of our Java.

For inline formatting in Java, we might want to use // @formatter:off/on
comments  (
http://stackoverflow.com/questions/1820908/how-to-turn-off-the-eclipse-code-formatter-for-certain-sections-of-java-code),
since occasionally inline formatting can be very useful for readability
(such as DMLScript.DMLOptions).

Deron



On Tue, May 2, 2017 at 5:25 PM, Matthias Boehm 
wrote:

> thanks Deron for centralizing this discussion, as this could help to avoid
> redundancy spread across many individual JIRAs and PRs. Overall, I think it
> would be good to agree on individual style guides for DML and Java.
>
> I'm fine with using spaces for DML scripts because they are rarely changed
> once written. However, for Java, I'd strongly prefer tabs for indentation
> because tabs are (1) faster to navigate, and (2) allow to configure the dev
> environment according to subjective preferences. For inline formatting both
> should use spaces though.
>
> Finally, I would recommend to also include common inconsistency such as
> exception handling (catch all vs redundant error messages),
> hashcode/equals, unnecessary branches, etc.
>
> Regards,
> Matthias
>
>
> On 5/2/2017 7:15 PM, Deron Eriksson wrote:
>
>> Recently Matthias, Mike, and I discussed the issue of DML code style on
>> SYSTEMML-1406 (https://issues.apache.org/jira/browse/SYSTEMML-1406). We
>> also have an issue regarding Java code style on SYSTEMML-137 (
>> https://issues.apache.org/jira/browse/SYSTEMML-137).
>>
>> In the discussion on SYSTEMML-1406, it sounds like Matthias, Mike, and I
>> all see value in having a consistent style, although individual
>> preferences
>> differ. I would like to start a short discussion to see if we could apply
>> common style standards to our Java and DML files.
>>
>> WRT Java, perhaps the Google Style Guide (
>> https://google.github.io/styleguide/javaguide.html) would be a good place
>> to start.
>> https://github.com/google/styleguide/blob/gh-pages/eclipse-
>> java-google-style.xml
>> https://github.com/google/styleguide/blob/gh-pages/intellij-
>> java-google-style.xml
>>
>> We could use these Eclipse/IntelliJ Java style templates as a base and
>> modify them for any changes we agree upon (for example, tabs vs spaces for
>> indentation). We could then check these templates into our project so that
>> everyone who contributes to SystemML can apply the common style to code,
>> thus adding consistency to the project.
>>
>> WRT DML, the main issue we discussed was tabs vs spaces for indentation.
>>
>> Some options I see are:
>> 1) No official DML/Java styles
>> 2) DML/Java styles (use spaces for indents, with style guide as basis for
>> Java)
>> 3) DML/Java styles (use tabs for indents, with style guide as basis for
>> Java)
>>
>> Although I would prefer 2), I would be happy with 3) as an improvement
>> from
>> our existing 1). We could also have alternate options such as spaces for
>> DML and tabs for Java.
>>
>> Thoughts?
>> Deron
>>
>>


-- 
Deron Eriksson
Spark Technology Center
http://www.spark.tc/


Re: Sparse Matrix Storage Consumption Issue

2017-05-03 Thread Matthias Boehm
to summarize, this was an issue of selecting serialized representations 
for large ultra-sparse matrices. Thanks again for sharing your feedback 
with us.


1) In-memory representation: In CSR every non-zero will require 12 bytes 
- this is 240MB in your case. The overall memory consumption, however, 
depends on the distribution of non-zeros: In CSR, each block with at 
least one non-zero requires 4KB for row pointers. Assuming uniform 
distribution (the worst case), this gives us 80GB. This is likely the 
problem here. Every empty block would have an overhead of 44Bytes but 
for the worst-case assumption, there are no empty blocks left. We do not 
use COO for checkpoints because it would slow down subsequent operations.


2) Serialized/on-disk representation: For sparse datasets that are 
expected to exceed aggregate memory, we used to use a serialized 
representation (with storage level MEM_AND_DISK_SER) which uses sparse, 
ultra-sparse, or empty representations. In this form, ultra-sparse 
blocks require 9 + 16*nnz bytes and empty blocks require 9 bytes. 
Therefore, with this representation selected, you're dataset should 
easily fit in aggregate memory. Also, note that chkpoint is only a 
transformation that persists the rdd, the subsequent operation then 
pulls the data into memory.


At a high-level this was a bug. We missed ultra-sparse representations 
when introducing an improvement that stores sparse matrices in MCSR 
format in CSR format on checkpoints which eliminated the need to use a 
serialized storage level. I just deliver a fix. Now we store such 
ultra-sparse matrices again in serialized form which should 
significantly reduce the memory pressure.


Regards,
Matthias

On 5/3/2017 9:38 AM, Mingyang Wang wrote:

Hi all,

I was playing with a super sparse matrix FK, 2e7 by 1e6, with only one
non-zero value on each row, that is 2e7 non-zero values in total.

With driver memory of 1GB and executor memory of 100GB, I found the HOP
"Spark chkpoint", which is used to pin the FK matrix in memory, is really
expensive, as it invokes lots of disk operations.

FK is stored in binary format with 24 blocks, each block is ~45MB, and ~1GB
in total.

For example, with the script as

"""
FK = read($FK)
print("Sum of FK = " + sum(FK))
"""

things worked fine, and it took ~8s.

While with the script as

"""
FK = read($FK)
if (1 == 1) {}
print("Sum of FK = " + sum(FK))
"""

things changed. It took ~92s and I observed lots of disk spills from logs.
Based on the stats from Spark UI, it seems the materialized FK requires

54GB storage and thus introduces disk operations.


I was wondering, is this the expected behavior of a super sparse matrix?


Regards,
Mingyang



Sparse Matrix Storage Consumption Issue

2017-05-03 Thread Mingyang Wang
Hi all,

I was playing with a super sparse matrix FK, 2e7 by 1e6, with only one
non-zero value on each row, that is 2e7 non-zero values in total.

With driver memory of 1GB and executor memory of 100GB, I found the HOP
"Spark chkpoint", which is used to pin the FK matrix in memory, is really
expensive, as it invokes lots of disk operations.

FK is stored in binary format with 24 blocks, each block is ~45MB, and ~1GB
in total.

For example, with the script as

"""
FK = read($FK)
print("Sum of FK = " + sum(FK))
"""

things worked fine, and it took ~8s.

While with the script as

"""
FK = read($FK)
if (1 == 1) {}
print("Sum of FK = " + sum(FK))
"""

things changed. It took ~92s and I observed lots of disk spills from logs.
Based on the stats from Spark UI, it seems the materialized FK requires
>54GB storage and thus introduces disk operations.

I was wondering, is this the expected behavior of a super sparse matrix?


Regards,
Mingyang


Build failed in Jenkins: SystemML-DailyTest #966

2017-05-03 Thread jenkins
See 

Changes:

[npansar] [MINOR] Updated documentation and improved log messages

[npansar] [MINOR] Show native library paths only when log4j is set debug or 
lower

--
[...truncated 29200 lines...]
17/05/03 03:56:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 
(TID 0) in 389 ms on localhost (executor driver) (1/1)
17/05/03 03:56:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose 
tasks have all completed, from pool default
17/05/03 03:56:27 INFO scheduler.DAGScheduler: ShuffleMapStage 1 
(parallelizePairs at SparkExecutionContext.java:706) finished in 0.266 s
17/05/03 03:56:27 INFO scheduler.DAGScheduler: looking for newly runnable stages
17/05/03 03:56:27 INFO scheduler.DAGScheduler: running: Set(ShuffleMapStage 0)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 2)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: failed: Set()
17/05/03 03:56:27 INFO scheduler.DAGScheduler: ShuffleMapStage 0 
(parallelizePairs at SparkExecutionContext.java:706) finished in 0.431 s
17/05/03 03:56:27 INFO scheduler.DAGScheduler: looking for newly runnable stages
17/05/03 03:56:27 INFO scheduler.DAGScheduler: running: Set()
17/05/03 03:56:27 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 2)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: failed: Set()
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
(MapPartitionsRDD[5] at mapValues at BinarySPInstruction.java:117), which has 
no missing parents
17/05/03 03:56:27 INFO memory.MemoryStore: Block broadcast_2 stored as values 
in memory (estimated size 4.2 KB, free 1033.8 MB)
17/05/03 03:56:27 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
bytes in memory (estimated size 2.3 KB, free 1033.8 MB)
17/05/03 03:56:27 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
memory on 169.54.146.43:55160 (size: 2.3 KB, free: 1033.8 MB)
17/05/03 03:56:27 INFO spark.SparkContext: Created broadcast 2 from broadcast 
at DAGScheduler.scala:996
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
ResultStage 2 (MapPartitionsRDD[5] at mapValues at BinarySPInstruction.java:117)
17/05/03 03:56:27 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 
tasks
17/05/03 03:56:27 INFO scheduler.FairSchedulableBuilder: Added task set 
TaskSet_2.0 tasks to pool default
17/05/03 03:56:27 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 
(TID 2, localhost, executor driver, partition 0, PROCESS_LOCAL, 5813 bytes)
17/05/03 03:56:27 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
17/05/03 03:56:27 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty 
blocks out of 1 blocks
17/05/03 03:56:27 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote 
fetches in 8 ms
17/05/03 03:56:27 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty 
blocks out of 1 blocks
17/05/03 03:56:27 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote 
fetches in 0 ms
17/05/03 03:56:27 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 
2). 2077 bytes result sent to driver
17/05/03 03:56:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 
(TID 2) in 104 ms on localhost (executor driver) (1/1)
17/05/03 03:56:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose 
tasks have all completed, from pool default
17/05/03 03:56:27 INFO scheduler.DAGScheduler: ResultStage 2 (collect at 
SparkExecutionContext.java:796) finished in 0.106 s
17/05/03 03:56:27 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 
169.54.146.43:55160 in memory (size: 1305.0 B, free: 1033.8 MB)
17/05/03 03:56:27 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on 
169.54.146.43:55160 in memory (size: 2.3 KB, free: 1033.8 MB)
17/05/03 03:56:27 INFO spark.ContextCleaner: Cleaned shuffle 0
17/05/03 03:56:27 INFO spark.ContextCleaner: Cleaned shuffle 1
17/05/03 03:56:27 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 
169.54.146.43:55160 in memory (size: 1302.0 B, free: 1033.8 MB)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Registering RDD 7 
(parallelizePairs at SparkExecutionContext.java:706)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Registering RDD 6 
(parallelizePairs at SparkExecutionContext.java:706)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Got job 1 (collect at 
SparkExecutionContext.java:796) with 1 output partitions
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Final stage: ResultStage 5 
(collect at SparkExecutionContext.java:796)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Parents of final stage: 
List(ShuffleMapStage 3, ShuffleMapStage 4)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Missing parents: 
List(ShuffleMapStage 3, ShuffleMapStage 4)
17/05/03 03:56:27 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 3 
(ParallelCollectionRDD[7] at parallelizePairs at