[VOTE] Apache SystemML 0.14.0-incubating (RC1)

2017-04-03 Thread Arvind Surve
Please vote on releasing the following candidate as Apache SystemML version 
0.14.0-incubating !

The vote is open for at least 72 hours and passes if a majority of at least 3 
+1 PMC votes are cast.

[ ] +1 Release this package as Apache SystemML 0.14.0-incubating
[ ] -1 Do not release this package because ...

To learn more about Apache SystemML, please see http://systemml.apache.org/

The tag to be voted on is v0.14.0-incubating-rc1 
(c8b71564edd99ef4dc8c4ff52ca75986ca617db4)

https://github.com/apache/incubator-systemml/commit/c8b71564edd99ef4dc8c4ff52ca75986ca617db4

The release artifacts can be found at :
https://dist.apache.org/repos/dist/dev/incubator/systemml/0.14.0-incubating-rc1/

The maven release artifacts, including signatures, digests, etc. can be found 
at:

https://repository.apache.org/content/repositories/orgapachesystemml-1018/org/apache/systemml/systemml/0.14.0-incubating/

=
== Apache Incubator release policy ==
=
Please find below the guide to release management during incubation:
http://incubator.apache.org/guides/releasemanagement.html

===
== How can I help test this release? ==
===
If you are a SystemML user, you can help us test this release by taking an 
existing Algorithm or workload and running on this release candidate, then
reporting any regressions.


== What justifies a -1 vote for this release? ==

-1 votes should only occur for significant stop-ship bugs or legal related 
issues (e.g. wrong license, missing header files, etc). Minor bugs or 
regressions should not block this release.
 -Arvind
 Arvind Surve | Spark Technology Center  | http://www.spark.tc/

Re: Java compiler for code generation

2017-04-03 Thread Luciano Resende
Is dependency really an issue today, particularly when we bundle the
dependencies with the SystemML jar ? I'd rather include a dependency then
reinventing the wheel and create some code again (unless the dependency
code is flawed).

Also, +1 for continuously reviewing / updating / triming out dependencies.

On Mon, Apr 3, 2017 at 11:04 AM,  wrote:

> Using Janino sounds like a great idea.  As for the footprint size for
> Java-only execution modes, it might make sense to do an audit of our
> current dependencies to see if anything can be removed to make up for the
> additional amount.  Then we could just use it in all scenarios without
> worry.
>
> --
>
> Mike Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
>
> Sent from my iPhone.
>
>
> > On Mar 31, 2017, at 9:25 PM, Matthias Boehm 
> wrote:
> >
> > that is a good question. Yes, if we want to enable code generation in
> such
> > a scenario it would also need Janino, which increases our footprint by
> > roughly 0.6MB.
> >
> > Btw, Janino fits much better into such an in-memory deployment because it
> > compiles classes in-memory without the need to write class files into a
> > local working directory. The same could be done for
> > javax.tools.JavaCompiler, but would require to custom in-memory
> > JavaFileManager.
> >
> > Regards,
> > Matthias
> >
> > On Fri, Mar 31, 2017 at 9:14 PM, Berthold Reinwald 
> > wrote:
> >
> >> Sounds like a good idea.
> >>
> >> Wrt codegen, in a pure Java scoring environment w/o Spark and Hadoop,
> will
> >> the dependency on Janino still be there (that question applies to JDK as
> >> well), and what is the footprint?
> >>
> >> Regards,
> >> Berthold Reinwald
> >> IBM Almaden Research Center
> >> office: (408) 927 2208; T/L: 457 2208
> >> e-mail: reinw...@us.ibm.com
> >>
> >>
> >>
> >> From:   Matthias Boehm 
> >> To: dev@systemml.incubator.apache.org
> >> Date:   03/31/2017 08:17 PM
> >> Subject:Java compiler for code generation
> >>
> >>
> >>
> >> Hi all,
> >>
> >> currently, our new code generator for operator fusion, uses the
> >> programmatic javax.tools.JavaCompiler, which is Java's standard API for
> >> compilation. Despite a plan cache that mitigates unnecessary compilation
> >> and recompilation overheads, we still see significant end-to-end
> overhead
> >> especially for small input data.
> >>
> >> Moving forward, I'd like to switch to Janino
> >> (org.codehaus.janino.SimpleCompiler), which is a fast in-memory Java
> >> compiler with restricted language support. The advantages are
> >>
> >> (1) Reduced compilation overhead: On end-to-end scenarios for L2SVM,
> GLM,
> >> and MLogreg, Janino improved total javac compilation time from 2.039 to
> >> 0.195 (14 operators), from 8.134 to 0.411 (82 operators), and from 4.854
> >> to
> >> 0.283 (46 operators), respectively. At the same time, there was no
> >> measurable impact on runtime efficiency, but even slightly reduced JIT
> >> compilation overhead.
> >>
> >> (2) Removed JDK requirement: Using the standard javax.tools.JavaCompiler
> >> requires the existence of a JDK, while Janino only requires a JRE, which
> >> means it makes it easier to apply code generation by default.
> >>
> >> However, I'm raising this here as Janino would add another explicit
> >> dependency (with BSD license). Fortunately, Spark also uses Janino for
> >> whole-stage-codegen. So we should be able to mark Janino as provided
> >> library. The only issue is a pure Hadoop environment, where we still
> want
> >> to use code generation for CP operations. To simplify the build, I could
> >> imagine using the javax.tools.JavaCompiler for hadoop execution types,
> but
> >> Janino by default.
> >>
> >> If you have any concerns, please let me know by Monday; otherwise I'd
> like
> >> to push this change into our upcoming 0.14 release.
> >>
> >>
> >> Regards,
> >> Matthias
> >>
> >>
> >>
> >>
> >>
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Java compiler for code generation

2017-04-03 Thread dusenberrymw
Using Janino sounds like a great idea.  As for the footprint size for Java-only 
execution modes, it might make sense to do an audit of our current dependencies 
to see if anything can be removed to make up for the additional amount.  Then 
we could just use it in all scenarios without worry.

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Mar 31, 2017, at 9:25 PM, Matthias Boehm  wrote:
> 
> that is a good question. Yes, if we want to enable code generation in such
> a scenario it would also need Janino, which increases our footprint by
> roughly 0.6MB.
> 
> Btw, Janino fits much better into such an in-memory deployment because it
> compiles classes in-memory without the need to write class files into a
> local working directory. The same could be done for
> javax.tools.JavaCompiler, but would require to custom in-memory
> JavaFileManager.
> 
> Regards,
> Matthias
> 
> On Fri, Mar 31, 2017 at 9:14 PM, Berthold Reinwald 
> wrote:
> 
>> Sounds like a good idea.
>> 
>> Wrt codegen, in a pure Java scoring environment w/o Spark and Hadoop, will
>> the dependency on Janino still be there (that question applies to JDK as
>> well), and what is the footprint?
>> 
>> Regards,
>> Berthold Reinwald
>> IBM Almaden Research Center
>> office: (408) 927 2208; T/L: 457 2208
>> e-mail: reinw...@us.ibm.com
>> 
>> 
>> 
>> From:   Matthias Boehm 
>> To: dev@systemml.incubator.apache.org
>> Date:   03/31/2017 08:17 PM
>> Subject:Java compiler for code generation
>> 
>> 
>> 
>> Hi all,
>> 
>> currently, our new code generator for operator fusion, uses the
>> programmatic javax.tools.JavaCompiler, which is Java's standard API for
>> compilation. Despite a plan cache that mitigates unnecessary compilation
>> and recompilation overheads, we still see significant end-to-end overhead
>> especially for small input data.
>> 
>> Moving forward, I'd like to switch to Janino
>> (org.codehaus.janino.SimpleCompiler), which is a fast in-memory Java
>> compiler with restricted language support. The advantages are
>> 
>> (1) Reduced compilation overhead: On end-to-end scenarios for L2SVM, GLM,
>> and MLogreg, Janino improved total javac compilation time from 2.039 to
>> 0.195 (14 operators), from 8.134 to 0.411 (82 operators), and from 4.854
>> to
>> 0.283 (46 operators), respectively. At the same time, there was no
>> measurable impact on runtime efficiency, but even slightly reduced JIT
>> compilation overhead.
>> 
>> (2) Removed JDK requirement: Using the standard javax.tools.JavaCompiler
>> requires the existence of a JDK, while Janino only requires a JRE, which
>> means it makes it easier to apply code generation by default.
>> 
>> However, I'm raising this here as Janino would add another explicit
>> dependency (with BSD license). Fortunately, Spark also uses Janino for
>> whole-stage-codegen. So we should be able to mark Janino as provided
>> library. The only issue is a pure Hadoop environment, where we still want
>> to use code generation for CP operations. To simplify the build, I could
>> imagine using the javax.tools.JavaCompiler for hadoop execution types, but
>> Janino by default.
>> 
>> If you have any concerns, please let me know by Monday; otherwise I'd like
>> to push this change into our upcoming 0.14 release.
>> 
>> 
>> Regards,
>> Matthias
>> 
>> 
>> 
>> 
>> 


Re: GSoc 2017

2017-04-03 Thread Krishna Kalyan
Hello Nakul
Thank you so much for your feedback, especially during the weekend. I have
submitted the proposal and the final version attached below.

Cheers,
Krishna

On Mon, Apr 3, 2017 at 4:37 AM, Nakul Jindal  wrote:

> Your project proposal looks great. Be sure to submit a final project
> proposal wherever it is you need to.
>
> Thanks,
> Nakul
>
> On Apr 2, 2017, at 4:08 PM, Krishna Kalyan 
> wrote:
>
> Hello All,
> I have updated the proposal. I hope this one is better. Please share your
> feedback.
>
> https://docs.google.com/document/d/1DKWZTWvrvs73GYa1q3XEN5GF
> o8ALGjLH2DrIfRsJksA/edit#
>
> FYI : Student Application Deadline April 3 16:00 UTC.
>
>
> Regards,
> Krishna
>
> On Sun, Apr 2, 2017 at 2:39 PM, Krishna Kalyan 
> wrote:
>
>> Hello Nakul,
>> My comments in *Italics* below.
>>
>> On Sat, Apr 1, 2017 at 11:27 PM, Nakul Jindal  wrote:
>>
>>> Hi Krishna,
>>>
>>> Here are some questions/remarks i have about parts of your proposal:
>>>
>>> In the section titled Summary -
>>>
>>> "The systematic evaluation of performance can be measured with
>>> performance tests and micro-benchmarks"
>>> We currently do not have any micro benchmarks. Do you plan on adding
>>> any? (It would be awesome, but remember to keep the number of tasks
>>> reasonable given the time frame and your familiarity with the project)
>>>
>> *- Removed micro bench marks from the proposal. *
>>
>>>
>>> Your summary section feels like its generally applicable for performance
>>> testing on any project, which is good. However, when it comes to talking
>>> about what you'd actually be doing, I see - " build a benchmark
>>> infrastructure and conduct experiments, that compare different choices in
>>> critical parts (sparsity thresholds, optimisation decisions, etc..)".
>>>
>> *-  I agree and have made these changes.*
>>
>> Going over each point:
>>>
>>> 1. "build a benchmark infrastructure" - ok, i guess this subsumes pretty
>>> much all the tasks involved
>>> 2. "conduct experiments" - sure, although I think you mean testing your
>>> benchmarking infrastructure, please correct me if this is not what you meant
>>>
>>>
>> 3. "that compare different choices in critical parts"
>>> a. "sparsity thresholds" - awesome. You'd need to figure out what
>>> SystemML already does and what to add.
>>> b. "optimization decisions" - could you provide an example or two of
>>> what exactly you mean by this. Do you mean to enable and/or disable certain
>>> optimizations and run the perf suite and also automate the process? or
>>> something else?
>>> c. "etc" - more detail would be nice here. It would be nice to know what
>>> exactly you are committing to.
>>> *- will add more details in this section *
>>>
>>> In the section titled Deliverables -
>>>
>>> You mention
>>> - "automation for all performance tests" - awesome! this is the primary
>>> task
>>> - "automatic scripts to test performance on a cloud provider" - this is
>>> great
>>> - "web dashboard" - awesome! this is a nice-to-have
>>>
>>> But before the "cloud provider" and "web dashboard" task, we'd like to
>>> robustly check for errors and record performance numbers and generate
>>> reports. (Tasks 2 - 6 on https://issues.apache.org/j
>>> ira/browse/SYSTEMML-1451). I see that you've mentioned some of these
>>> tasks in you "Project milestones" section as "Understand metrics to be
>>> captured like time, memory, errors". It'd be good to put them here as well.
>>>
>> *- Will add this information under Deliverables*
>>
>>>
>>> Remember, you might also need to change the way SystemML reports errors
>>> and performance numbers to complete your tasks. You, along with the
>>> currently active members of SystemML might need to change the algorithms
>>> being tested as well.
>>>
>> *- Sure will keep this in mind and will account for this in proposal. *
>>
>>>
>>> In the section titled "Project Milestones" -
>>> Your project timeline looks good, the initial set of things to before
>>> May 30 and the fact that you've set aside the final week for buffer. You
>>> have dug down into a week by week schedule, which is good. I have some
>>> suggestion though:
>>>
>>> You need to
>>> T1. Understand what is happening now, try it out for yourself
>>>
>> *- Yes, I am following the documentation to simulate benchmarks on my
>> local system. *
>>
>> T2. You need to automate this process
>>> T3. You need to test that this automated process works as expected (and
>>> make it robust)
>>> T4. You need to add additional capabilities (like micro-benchmarks
>>> and/or parameterizing the tests and/or running it with sparse and dense
>>> sets)
>>>
>> *- I will account for T3 and T4 more explicitly in my proposal.*
>>
>>
>>> For each of the tasks that you mention in your deliverables, could you
>>> please think about how you'd spend each week doing either T1-3 for a
>>> deliverable that is now being done manually and T4 for one that is 

Re: GSoc 2017

2017-04-03 Thread Nakul Jindal
Your project proposal looks great. Be sure to submit a final project proposal 
wherever it is you need to. 

Thanks,
Nakul

> On Apr 2, 2017, at 4:08 PM, Krishna Kalyan  wrote:
> 
> Hello All,
> I have updated the proposal. I hope this one is better. Please share your 
> feedback.
> 
> https://docs.google.com/document/d/1DKWZTWvrvs73GYa1q3XEN5GFo8ALGjLH2DrIfRsJksA/edit#
> 
> FYI : Student Application Deadline April 3 16:00 UTC. 
> 
> 
> Regards,
> Krishna
> 
>> On Sun, Apr 2, 2017 at 2:39 PM, Krishna Kalyan  
>> wrote:
>> Hello Nakul,
>> My comments in Italics below.
>> 
>>> On Sat, Apr 1, 2017 at 11:27 PM, Nakul Jindal  wrote:
>>> Hi Krishna,
>>> 
>>> Here are some questions/remarks i have about parts of your proposal:
>>> 
>>> In the section titled Summary -
>>> 
>>> "The systematic evaluation of performance can be measured with performance 
>>> tests and micro-benchmarks"
>>> We currently do not have any micro benchmarks. Do you plan on adding any? 
>>> (It would be awesome, but remember to keep the number of tasks reasonable 
>>> given the time frame and your familiarity with the project)
>> - Removed micro bench marks from the proposal. 
>>> 
>>> Your summary section feels like its generally applicable for performance 
>>> testing on any project, which is good. However, when it comes to talking 
>>> about what you'd actually be doing, I see - " build a benchmark 
>>> infrastructure and conduct experiments, that compare different choices in 
>>> critical parts (sparsity thresholds, optimisation decisions, etc..)".
>> 
>> -  I agree and have made these changes.
>> 
>>> Going over each point:
>>> 
>>> 1. "build a benchmark infrastructure" - ok, i guess this subsumes pretty 
>>> much all the tasks involved 
>>> 2. "conduct experiments" - sure, although I think you mean testing your 
>>> benchmarking infrastructure, please correct me if this is not what you 
>>> meant 
>>> 3. "that compare different choices in critical parts"
>>>   a. "sparsity thresholds" - awesome. You'd need to figure out what 
>>> SystemML already does and what to add. 
>>>   b. "optimization decisions" - could you provide an example or two of what 
>>> exactly you mean by this. Do you mean to enable and/or disable certain 
>>> optimizations and run the perf suite and also automate the process? or 
>>> something else?
>>>   c. "etc" - more detail would be nice here. It would be nice to know what 
>>> exactly you are committing to.
>>> - will add more details in this section 
>>> 
>>> In the section titled Deliverables - 
>>> 
>>> You mention
>>> - "automation for all performance tests" - awesome! this is the primary task
>>> - "automatic scripts to test performance on a cloud provider" - this is 
>>> great
>>> - "web dashboard" - awesome! this is a nice-to-have
>>> 
>>> But before the "cloud provider" and "web dashboard" task, we'd like to 
>>> robustly check for errors and record performance numbers and generate 
>>> reports. (Tasks 2 - 6 on 
>>> https://issues.apache.org/jira/browse/SYSTEMML-1451). I see that you've 
>>> mentioned some of these tasks in you "Project milestones" section as 
>>> "Understand metrics to be captured like time, memory, errors". It'd be good 
>>> to put them here as well.
>> - Will add this information under Deliverables
>>> 
>>> Remember, you might also need to change the way SystemML reports errors and 
>>> performance numbers to complete your tasks. You, along with the currently 
>>> active members of SystemML might need to change the algorithms being tested 
>>> as well.
>> 
>> - Sure will keep this in mind and will account for this in proposal. 
>>> 
>>> In the section titled "Project Milestones" - 
>>> Your project timeline looks good, the initial set of things to before May 
>>> 30 and the fact that you've set aside the final week for buffer. You have 
>>> dug down into a week by week schedule, which is good. I have some 
>>> suggestion though:
>>> 
>>> You need to 
>>> T1. Understand what is happening now, try it out for yourself
>> 
>> - Yes, I am following the documentation to simulate benchmarks on my local 
>> system. 
>> 
>>> T2. You need to automate this process
>>> T3. You need to test that this automated process works as expected (and 
>>> make it robust)
>>> T4. You need to add additional capabilities (like micro-benchmarks and/or 
>>> parameterizing the tests and/or running it with sparse and dense sets)
>> 
>> - I will account for T3 and T4 more explicitly in my proposal.
>>  
>>> For each of the tasks that you mention in your deliverables, could you 
>>> please think about how you'd spend each week doing either T1-3 for a 
>>> deliverable that is now being done manually and T4 for one that is not 
>>> being done at all right now?
>>> Please revisit some of the tasks on your timeline with this in mind.
>>> 
>>> I'd also ask that you set some deliverable(s) for phase 1 (due on June 26), 
>>> phase 2 (due on July 26)