[jira] [Resolved] (AMATERASU-45) PySpark: refactor the PySpark runtime into it's own component

2019-06-08 Thread Nadav Har Tzvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMATERASU-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Har Tzvi resolved AMATERASU-45.
-
Resolution: Fixed

> PySpark: refactor the PySpark runtime into it's own component
> -
>
> Key: AMATERASU-45
> URL: https://issues.apache.org/jira/browse/AMATERASU-45
> Project: AMATERASU
>  Issue Type: Task
>Reporter: Yaniv Rodenski
>        Assignee: Nadav Har Tzvi
>Priority: Major
> Fix For: 0.2.1-incubating
>
>
> The PySpark Runtime should be extracted to its own component and available as 
> a dependency for PySpark actions developers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] datasets input file in user's repository

2019-03-18 Thread Nadav Har Tzvi
I assumed as much.
I will implement a change to the leader to load the data from the env, it
will become a part of the PR for Amaterasu-46.

Cheers,
Nadav



On Tue, 19 Mar 2019 at 01:36, Yaniv Rodenski  wrote:

> Hi Nadav,
>
> I think datasets should be per environment, (for example, it is very common
> to use different databases for dev/test/prod), so I think that datasets as
> configurations in Amaterasu should sit under env).
>
> Cheers,
> Yaniv
>
> On Tue, Mar 19, 2019 at 5:13 AM Nadav Har Tzvi 
> wrote:
>
> > Hi,
> >
> > Just wanna open this up for discussion as it seems we somehow skipped
> this
> > point.
> > Basically, by now we pretty much have the new datasets APIs in place in
> the
> > Python SDK and in implementing frameworks. (amaterasu-pyspark,
> > amaterasu-pandas, amaterasu-python)
> > The only question left is regarding the way we get the datasets
> > definitions.
> > Currently, we still look up the datasets definitions in the maki file,
> > under the action's exports.
> > Do we intend to keep it that way? I assume not as I think that every
> action
> > needs access to all defined datasets.
> > In that case, how will the user submit datasets configuration? Is it
> > another file next to the maki.yaml? Is it a file that resides in the
> > environment, e.g. next to the env.yaml? Is it not even a file on its own
> > but a part of the env.yaml?
> > Ideas, anyone?
> >
> > Let's discuss this please!
> >
> > Cheers,
> > Nadav
> >
>
>
> --
> Yaniv Rodenski
>
> +61 477 778 405
> ya...@shinto.io
>


[jira] [Created] (AMATERASU-74) amaterasu-vagrant is broken

2019-03-04 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-74:
---

 Summary: amaterasu-vagrant is broken
 Key: AMATERASU-74
 URL: https://issues.apache.org/jira/browse/AMATERASU-74
 Project: AMATERASU
  Issue Type: Bug
Reporter: Nadav Har Tzvi


VirtualBox 6 basically killed the option to use symlinks in shared folders, 
thus it is impossible to use the jGit to clone repositories into the 
"/ama/repo" directory.

We need to fix this somehow.

Also, I have a local Vagrant-AWS  branch that was broken too, due to a 
different reason which has to do with Amazon's ENA requirement of any AMIs that 
run on the standard instance types, this rendered all the centos AMIs unusable. 
I did manage to setup Amaterasu manually on an Amazon Linux AMI with minor 
changes, but it needs ti be automated by vagrant as well.

To sum it, amaterasu-vagrant is totally broken, we need to revive it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Review and discussions thread for Amaterasu-45 PR

2019-02-20 Thread Nadav Har Tzvi
Ok,
I will do an E2E test on a pyspark only flow tonight. I think that after
rebasing onto master with the changes from Amaterasu-8, I will have to make
some adjustments in regard to the integration between the leader and the
executor. If there are any, we should be coordinated.

Cheers,
Nadav



On Wed, 20 Feb 2019 at 13:10, Arun Manivannan  wrote:

> Hi Nadav,
>
> Had a look at the datasets.yml and the bindings for it.  This aligns
> perfectly with the consensus that we had on the call that we had the day
> before.
>
> I'll have it implemented for the JVM by the end of this week.
>
> Cheers,
> Arun
>
> On Wed, Feb 20, 2019 at 3:12 AM Nadav Har Tzvi 
> wrote:
>
> > Hey everyone,
> >
> > I opened a draft PR for Amaterasu 45 (Python SDK, pyspark runtime, python
> > runtime). Do mind that this is still a WIP. I would like you to review
> this
> > PR from time to time as it evolves. The sooner I get inputs, the sooner
> > this feature will be out.
> >
> > PR: https://github.com/apache/incubator-amaterasu/pull/44
> >
> > Yaniv, Eyal, Arun. Please let's now define the final MVP of this feature
> so
> > I can know which reviews to defer and which to attend to before this PR
> is
> > closed and merged.
> >
> > Arun, this is most important, if I ended up implementing a design that is
> > vastly different than the scala runtime design, we should talk ASAP.
> >
> > I would like the community to give any feedback. Did I miss something?
> Did
> > I make some stupid mistake? Does the design look awkward? etc.
> >
> > Also a note on the python runtime. It is currently February 2019. As of
> > January 1st 2020 Python2 is EOL. Many third party Python packages already
> > dropped support entirely for Python2. Because of that, the entire SDK and
> > runtimes are implemented in Python >=3.4 (Python <= 3.3.x is EOL)
> > If you think this is problematic, please raise a flag.
> >
> > Cheers,
> > Nadav
> >
>


Re: [jira] [Created] (AMATERASU-52) Implement AmaContext.datastores

2019-01-28 Thread Nadav Har Tzvi
Hey Arun,

I kinda feel like the datastores yaml is somewhat obscure. I propose the
following structure.

Instead of

datasets:
  hive:
- key: transactions
  uri: /user/somepath
  format: parquet
  database: transations_daily
  table: transx

- key: second_transactions
  uri: /seconduser/somepath
  format: avro
  database: transations_monthly
  table: avro_table
  file:
- key: users
  uri: s3://filestore
  format: parquet
  mode: overwrite

I would have

datasets:
  - key: transactions
uri: /user/somepath
format: parquet
database: transations_daily
table: transx
type: hive
  - key: second_transactions
uri: /seconduser/somepath
format: avro
database: transations_monthly
table: avro_table
type: hive
  - key: users
uri: s3://filestore
format: parquet
mode: overwrite
type: file

In my opinion it is more straightforward and uniform. I think it is also
more straightforward code-wise.
What do you think?

Cheers,
Nadav



On Mon, 14 Jan 2019 at 00:57, Yaniv Rodenski  wrote:

> Hi Arun,
>
> I've added my comments to the PR, but good call, I agree @Nadav Har Tzvi
>  should at least review as you both need to
> maintain compatible APIs.
>
> Cheers,
> Yaniv
>
> On Sun, Jan 13, 2019 at 10:21 PM Arun Manivannan  wrote:
>
>> Hi Guy, Yaniv and Nadiv,
>>
>> This PR <https://github.com/apache/incubator-amaterasu/pull/39> just
>> captures part of the issue - the datasets.yaml, ConfigManager and the
>> testcases. The Integration with the AmaContext is yet to be done but I
>> would like to get your thoughts on the implementation.
>>
>> Guy - Would it be okay if you could help throw some light on the syntax
>> and
>> the idiomatic part of Kotlin itself. Newbie here.
>>
>> Cheers,
>> Arun
>>
>> On Fri, Oct 12, 2018 at 7:15 PM Yaniv Rodenski (JIRA) 
>> wrote:
>>
>> > Yaniv Rodenski created AMATERASU-52:
>> > ---
>> >
>> >  Summary: Implement AmaContext.datastores
>> >  Key: AMATERASU-52
>> >  URL:
>> https://issues.apache.org/jira/browse/AMATERASU-52
>> >  Project: AMATERASU
>> >   Issue Type: Task
>> > Reporter: Yaniv Rodenski
>> > Assignee: Arun Manivannan
>> >  Fix For: 0.2.1-incubating
>> >
>> >
>> > AmaContext.datastores should contain the data from datastores.yaml
>> >
>> >
>> >
>> > --
>> > This message was sent by Atlassian JIRA
>> > (v7.6.3#76005)
>> >
>>
>
>
> --
> Yaniv Rodenski
>
> +61 477 778 405
> ya...@shinto.io
>
>


Re: [DISCUSS] podling report

2019-01-03 Thread Nadav Har Tzvi
Yeah, I give it a +1

Cheers,
Nadav



On Tue, 1 Jan 2019 at 17:12, Yaniv Rodenski  wrote:

> Hi All,
>
> I propose the following report to be submitted.
>
> Amaterasu
>
>
> Apache Amaterasu is a framework providing configuration management and
>
> deployment for Big Data Pipelines.
>
>
> It provides the following capabilities:
>
>
> Continuous integration tools to package pipelines and run tests.
>
> A repository to store those packaged applications: the applications
>
> repository.
>
> A repository to store the pipelines, and engine configuration (for
>
> instance, the location of the Spark master, etc.): per environment - the
>
> configuration repository.
>
> A dashboard to monitor the pipelines.
>
> A DSL and integration hooks allowing third parties to easily integrate.
>
>
> Amaterasu has been incubating since 2017-09.
>
>
> Three most important issues to address in the move towards graduation:
>
>
>   1. Grow up user and contributor communities
>
>   2. Prepare documentation
>
>
> Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be
>
> aware of?
>
>
> How has the community developed since the last report?
>
>
>   Two new contributors have contributed code that have been merged. In
> addition, we are actively looking for more use cases and organizations to
> use Amaterasu.
>
>
> How has the project developed since the last report?
>
>
>   * 5 pull requests have been opened since the last report and 4 have been
> merged
>
>   * Since last report 9 more issues have been created and 4 out of them
> have been assigned
>
>
> Date of the last release:
>
>
>   12 July 2018
>
>
> When were the last committers or PMC members elected?
>
>
>   N/A
>
>
> Have your mentors been helpful and responsive or are things falling through
> the cracks? In the latter case, please list any open issues that need to be
> addressed.
>
>
>
>  N/A
>
>
> Signed-off-by:
>
>
>   [](amaterasu) Jean-Baptiste Onofré
>
>   [](amaterasu) Olivier Lamy
>
>   [](amaterasu) Davor Bonaci
>
> --
> Yaniv Rodenski
>


[DISCUSS] Common dependencies formats

2018-10-20 Thread Nadav Har Tzvi
Hey everyone,

In both conventions I spoke in (PyCon-IL and SDP) the participants were
inquiring about why we don't use common formats to specify job dependencies
(e.g. - requirements.txt in Python).

I want to bring this matter for discussion. What can we do to conform
better with the developers community? Is there any good reason for us to
stick with the current dependencies specification format (through yaml)
other than technical reasons?

Cheers,
Nadav


[DISCUSS] Dependencies resolution and action level dependencies

2018-10-20 Thread Nadav Har Tzvi
Hey everyone,

Yaniv and I were just discussing how to resolve dependencies in the new
frameworks architecture and integrate the dependencies with the concrete
cluster resource manager (Mesos/YARN)
We rolled with the idea of each runner (or base runner) performing the
dependencies resolution on its own.
So for example, the Spark Scala runner would resolve the required JARs and
do whatever it needs to do with them (e.g. spark-submit --jars --packages
--repositories, etc).
The base Python provider will resolve dependencies and dynamically generate
a requirement.txt file that will deployed to the executor.
The handling of the requirements.txt file differs between different
concrete Python runners. For example, a regular Python runner would simply
run pip install, while the pyspark runner would need to rearrange the
dependencies in a way that would be acceptable by spark-submit (
https://bytes.grubhub.com/managing-dependencies-and-artifacts-in-pyspark-7641aa89ddb7
sounds like a decent idea, comment if you have a better idea please)

So far I hope it makes sense.

The next item I want to discuss is as follows:
In the new architecture, we do hierarchical runtime environment resolution,
starting at the top job level and drilling down to the action level,
outputting one unified environment configuration file that is deployed to
the executor.
I suggest doing the same with dependencies.
Currently, we only have job level dependencies. I suggest that we provide
action level dependencies and resolve them in exactly the same manner as we
resolve the environment.
There should be quite a few benefits for this approach:

   1. It will give the option to have different versions of the same
   package in different actions. This is especially important if you have 2+
   pipeline developers working independently, this would reduce the
   integration costs by letting each action be more self-contained.
   2. It should lower the startup time per action. The more dependencies
   you have, the longer it takes to resolve and install them. Actions will no
   longer get any unnecessary dependencies.


What do you think? Does it make sense?

Cheers,
Nadav


[jira] [Created] (AMATERASU-43) Make the notifier available for job code

2018-07-07 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-43:
---

 Summary: Make the notifier available for job code
 Key: AMATERASU-43
 URL: https://issues.apache.org/jira/browse/AMATERASU-43
 Project: AMATERASU
  Issue Type: Improvement
Reporter: Nadav Har Tzvi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-42) Amaterasu needs to respect Maven classifiers

2018-07-07 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-42:
---

 Summary: Amaterasu needs to respect Maven classifiers
 Key: AMATERASU-42
 URL: https://issues.apache.org/jira/browse/AMATERASU-42
 Project: AMATERASU
  Issue Type: Improvement
Reporter: Nadav Har Tzvi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AMATERASU-41) Support nested custom configurations in jobs environments in PySpark

2018-07-07 Thread Nadav Har Tzvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMATERASU-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Har Tzvi updated AMATERASU-41:

Summary: Support nested custom configurations in jobs environments in 
PySpark  (was: Support nested custom configurations in jobs environments)

> Support nested custom configurations in jobs environments in PySpark
> 
>
> Key: AMATERASU-41
> URL: https://issues.apache.org/jira/browse/AMATERASU-41
> Project: AMATERASU
>  Issue Type: Improvement
>        Reporter: Nadav Har Tzvi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-41) Support nested custom configurations in jobs environments

2018-07-07 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-41:
---

 Summary: Support nested custom configurations in jobs environments
 Key: AMATERASU-41
 URL: https://issues.apache.org/jira/browse/AMATERASU-41
 Project: AMATERASU
  Issue Type: Improvement
Reporter: Nadav Har Tzvi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-40) Change Spark-Scala runner to use spark-submit

2018-07-07 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-40:
---

 Summary: Change Spark-Scala runner to use spark-submit
 Key: AMATERASU-40
 URL: https://issues.apache.org/jira/browse/AMATERASU-40
 Project: AMATERASU
  Issue Type: Improvement
Reporter: Nadav Har Tzvi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-39) Support SSH for Git

2018-07-07 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-39:
---

 Summary: Support SSH for Git
 Key: AMATERASU-39
 URL: https://issues.apache.org/jira/browse/AMATERASU-39
 Project: AMATERASU
  Issue Type: Task
Reporter: Nadav Har Tzvi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AMATERASU-38) Failure to interpret multiline Scala code blocks

2018-07-07 Thread Nadav Har Tzvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMATERASU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Har Tzvi updated AMATERASU-38:

Priority: Blocker  (was: Major)

> Failure to interpret multiline Scala code blocks
> 
>
> Key: AMATERASU-38
> URL: https://issues.apache.org/jira/browse/AMATERASU-38
> Project: AMATERASU
>  Issue Type: Bug
>        Reporter: Nadav Har Tzvi
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Discussion] Changing Amaterasu deployment strategy

2018-06-15 Thread Nadav Har Tzvi
Hey everyone,

While working on the new CLI (rather, fixing the damage caused by the
rebase ontop of RC3), I came across some issues and questions that, at
least in my opinion, should be addressed when we deploy Amaterasu.


   1. Configuration -
  1. *Status at 0.2.0-rc3:*
  amaterasu.properties file is bundled in the deployment package (the
  .tar file) and resides in the same directory as the entry point
  (ama-start-mesos.sh, ama-start-yarn.sh)
  2. *Suggestion for 0.2.1*:
 1. The new CLI provided the "ama setup" command that generates the
 configuration file based on a set of questions (TODO - change
to try and
 detect the cluster type and vendor - e.g. Hortonworks HDP, AWS EMR,
 Standalone Mesos, DC/OS, etc)
 2. Change amaterasu.properties to amaterasu.conf and have it
 located at /etc/amaterasu by default, to conform to how
things are usually
 done with many other Apache (and non-Apache) projects.
  2. Amaterasu assets -
  1. *Status at 0.2.0-rc3:*
  When ama-start-mesos.sh or ama-start-yarn.sh is invoked for the first
  time, the relevant dependencies are downloaded (mesos - Spark,
Miniconda |
  yarn - Miniconda) into the {AMATERASU_HOME}/dist directory, by
convention,
  AMATERASU_HOME=/ama
  2. *Suggestion for 0.2.1*:
  1. Again, the new CLI "ama setup" command also takes care of
 downloading the relevant dependencies.
 2. We need a default path for {AMATERASU_HOME}, {AMATERASU_HOME}
 will hold the relevant Amaterasu JARs and any files that are
required for
 Amaterasu to work properly (spark, Miniconda, etc). What
should this path
 be?
 My suggestion based on other projects:
 /usr/share - Apache Marathon
 /usr/lib - Apache Zookeeper

 I tend to go with /usr/share, simply because /usr/lib is intended
 for shared objects. So for example we will have
/usr/share/amaterasu, any
 thoughts?
 3. Distribution methods -
  1. *Status at 0.2.0-rc3:*
  Users who wish to use Amaterasu need to get a the distributable TAR
  and manually install and configure Amaterasu.
  2.
*Suggestion for 0.2.1: *
 1. The new CLI takes care of some of the things during "ama
 setup", but not everything. (see my fork for details:
     
https://github.com/nadav-har-tzvi/incubator-amaterasu/tree/feature/ama-cli
 )
 2. Create a RPM package, so users will be able to add a repository
 and simply "yum install amaterasu".


If you have any thoughts you'd like to share in the matter, please do.

Cheers,
Nadav


[jira] [Updated] (AMATERASU-34) Running Amaterasu during development without a cluster

2018-06-12 Thread Nadav Har Tzvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMATERASU-34?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Har Tzvi updated AMATERASU-34:

Issue Type: New Feature  (was: Task)

> Running Amaterasu during development without a cluster
> --
>
> Key: AMATERASU-34
> URL: https://issues.apache.org/jira/browse/AMATERASU-34
> Project: AMATERASU
>  Issue Type: New Feature
>        Reporter: Nadav Har Tzvi
>Priority: Major
>
> As a pipeline developer, I'd like to be able to run Amaterasu with a local 
> configuration in a way that I don't need to setup a cluster.
> The reasoning behind this is that I want to experience quick turnaround of 
> the pipeline as a whole while developing different components of the pipeline.
> Currently the closest solution, is to use the amaterasu-vagrant repository to 
> setup a local mesos cluster on VBox, but it is still not good enough in my 
> opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AMATERASU-34) Running Amaterasu during development without a cluster

2018-06-12 Thread Nadav Har Tzvi (JIRA)


 [ 
https://issues.apache.org/jira/browse/AMATERASU-34?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Har Tzvi updated AMATERASU-34:

Issue Type: Improvement  (was: New Feature)

> Running Amaterasu during development without a cluster
> --
>
> Key: AMATERASU-34
> URL: https://issues.apache.org/jira/browse/AMATERASU-34
> Project: AMATERASU
>  Issue Type: Improvement
>        Reporter: Nadav Har Tzvi
>Priority: Major
>
> As a pipeline developer, I'd like to be able to run Amaterasu with a local 
> configuration in a way that I don't need to setup a cluster.
> The reasoning behind this is that I want to experience quick turnaround of 
> the pipeline as a whole while developing different components of the pipeline.
> Currently the closest solution, is to use the amaterasu-vagrant repository to 
> setup a local mesos cluster on VBox, but it is still not good enough in my 
> opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-34) Running Amaterasu during development without a cluster

2018-06-12 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-34:
---

 Summary: Running Amaterasu during development without a cluster
 Key: AMATERASU-34
 URL: https://issues.apache.org/jira/browse/AMATERASU-34
 Project: AMATERASU
  Issue Type: Task
Reporter: Nadav Har Tzvi


As a pipeline developer, I'd like to be able to run Amaterasu with a local 
configuration in a way that I don't need to setup a cluster.

The reasoning behind this is that I want to experience quick turnaround of the 
pipeline as a whole while developing different components of the pipeline.

Currently the closest solution, is to use the amaterasu-vagrant repository to 
setup a local mesos cluster on VBox, but it is still not good enough in my 
opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-32) Investigate why Amaterasu requires minimum of 2G memory to run on DC/OS

2018-06-05 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-32:
---

 Summary: Investigate why Amaterasu requires minimum of 2G memory 
to run on DC/OS
 Key: AMATERASU-32
 URL: https://issues.apache.org/jira/browse/AMATERASU-32
 Project: AMATERASU
  Issue Type: Task
Affects Versions: 0.2.1-incubating
Reporter: Nadav Har Tzvi
 Fix For: 0.2.1-incubating


This is even weirder than the problem we have in EMR. In DC/OS we can't do 
anything without requesting 2G of memory from Mesos, and that's for the 
job-samples.

Why on standalone deployment of Mesos we need 1G of memory and on DC/OS it is 
2G?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-31) Investigate why PySpark actions on EMR require minimum of 2G memory

2018-06-05 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-31:
---

 Summary: Investigate why PySpark actions on EMR require minimum of 
2G memory
 Key: AMATERASU-31
 URL: https://issues.apache.org/jira/browse/AMATERASU-31
 Project: AMATERASU
  Issue Type: Task
Affects Versions: 0.2.0-incubating
Reporter: Nadav Har Tzvi
 Fix For: 0.2.1-incubating






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-29) PySpark breaks for jobs without extra configurations

2018-06-04 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-29:
---

 Summary: PySpark breaks for jobs without extra configurations
 Key: AMATERASU-29
 URL: https://issues.apache.org/jira/browse/AMATERASU-29
 Project: AMATERASU
  Issue Type: Bug
Affects Versions: 0.2.0-incubating
Reporter: Nadav Har Tzvi
Assignee: Nadav Har Tzvi
 Fix For: 0.2.1-incubating


PySpark execution breaks when the job.yml is missing the "configurationa" key. 
There is an ugly "if" statement in the code that is used for testing, probably 
could be solved in a more elegant way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Amaterasu release 0.2.0-incubating, release candidate #3

2018-05-29 Thread Nadav Har Tzvi
Yaniv, Eyal, this might be related to the same issue you faced with HDP.
Can you confirm?

On Tue, May 29, 2018, 17:58 Arun Manivannan  wrote:

> +1 from me
>
> Unit Tests and Build ran fine.
>
> Tested on HDP (VM) but had trouble allocating containers (didn't have that
> before).  Apparently Centos VMs are known to have this problem. Disabled
> physical memory check  (yarn.nodemanager.pmem-check-enabled) and ran jobs
> successfully.
>
>
>
>
>
> On Tue, May 29, 2018 at 10:42 PM Kirupa Devarajan  >
> wrote:
>
> > Unit tests passing and build was successful on the branch
> > "version-0.2.0-incubating-rc3"
> >
> > +1 from me
> >
> > Cheers,
> > Kirupa
> >
> >
> > On Tue, May 29, 2018 at 3:06 PM, guy peleg  wrote:
> >
> > > +1 looks good to me
> > >
> > > On Tue, May 29, 2018, 14:39 Nadav Har Tzvi 
> > wrote:
> > >
> > > > +1 approve. Tested multiple times and after a long round of fixing
> and
> > > > testing over and over.
> > > >
> > > > Cheers,
> > > > Nadav
> > > >
> > > >
> > > > On 29 May 2018 at 07:38, Yaniv Rodenski  wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > We have fixed the legal issues, as well as a bug found by @Nadav
> > please
> > > > > review and vote on the release candidate #3 for the version
> > > > > 0.2.0-incubating, as follows
> > > > >
> > > > > [ ] +1, Approve the release
> > > > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > > > >
> > > > > The complete staging area is available for your review, which
> > includes:
> > > > >
> > > > > * JIRA release notes [1],
> > > > > * the official Apache source release to be deployed to
> > dist.apache.org
> > > > > [2],
> > > > > which is signed with the key with fingerprint [3],
> > > > > * source code tag "version-0.2.0-incubating-rc3" [4],
> > > > > * Java artifacts were built with Gradle 3.1 and OpenJDK/Oracle JDK
> > > > > 1.8.0_151
> > > > >
> > > > > The vote will be open for at least 72 hours. It is adopted by
> > majority
> > > > > approval, with at least 3 PMC affirmative votes.
> > > > >
> > > > > Thanks,
> > > > > Yaniv
> > > > >
> > > > > [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> > > > > projectId=12321521=12342793
> > > > > [2] https://dist.apache.org/repos/dist/dev/incubator/amaterasu/
> > > 0.2.0rc3/
> > > > > [3]
> https://dist.apache.org/repos/dist/dev/incubator/amaterasu/KEYS
> > > > > [4] https://github.com/apache/incubator-amaterasu/tags
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] Amaterasu release 0.2.0-incubating, release candidate #3

2018-05-28 Thread Nadav Har Tzvi
+1 approve. Tested multiple times and after a long round of fixing and
testing over and over.

Cheers,
Nadav


On 29 May 2018 at 07:38, Yaniv Rodenski  wrote:

> Hi everyone,
>
> We have fixed the legal issues, as well as a bug found by @Nadav please
> review and vote on the release candidate #3 for the version
> 0.2.0-incubating, as follows
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2],
> which is signed with the key with fingerprint [3],
> * source code tag "version-0.2.0-incubating-rc3" [4],
> * Java artifacts were built with Gradle 3.1 and OpenJDK/Oracle JDK
> 1.8.0_151
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Yaniv
>
> [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12321521=12342793
> [2] https://dist.apache.org/repos/dist/dev/incubator/amaterasu/0.2.0rc3/
> [3] https://dist.apache.org/repos/dist/dev/incubator/amaterasu/KEYS
> [4] https://github.com/apache/incubator-amaterasu/tags
>


Re: Speeding up Anaconda deployment on executors

2018-05-27 Thread Nadav Har Tzvi
I am kinda inclined towards a solution that could be performed at setup
time. Let's try to explore in that direction. If we manage to nail this at
setup time, it means that the runtime will be lightning fast (compared to
what it is now :))

Cheers,
Nadav


On 28 May 2018 at 00:17, Nadav Har Tzvi <nadavhart...@gmail.com> wrote:

> Hey everyone,
>
> So we have this issue, Anaconda takes forever to deploy on the executors,
> whether it is YARN or Mesos.
>
> Let's first discuss why is it like this right now.
>
> First, let's see for each platform, how Apache Amaterasu interacts with
> the underlying platform, in regard to what smallest independent unit that
> is awarded its own isolated execution environment.
>
> *Apache Mesos:*
> In Apache Mesos, we get our own nifty set of instances and executors. An
> instance obviously can host multiple executors. depending on its capacity.
> Thus the smallest independent unit here is the executor itself.
>
> *Apache Hadoop YARN*:
> On YARN, we have a similar set of resources, we have nodes, each node is a
> host to containers.
>
> Great, so far it sounds similar, right? Here is where Apache Amaterasu
> takes things a bit differently for each platform.
>
> In Apache Mesos, everything is run on the same executor, regardless of how
> many actions the job has. So if the job has 20 actions, they will run
> sequentially on the same executor, resulting in the smallest independent
> unit being the job itself, as only the job deserves its own running
> environment.
>
> On Hadoop, things are different, a lot.
> To start, each action is treated by YARN as a different application, with
> its own set of containers. This means that on YARN, action is the smallest
> independent unit.
>
> So what's the problem actually? So the problem in general is that we
> cannot rely on the existence of 3rd party utilities, libraries, you name
> it, on the target execution environment. This forces us to bundle anything
> we need along with the job execution process.
> Anaconda is exactly such 3rd party utility that we desperately need in
> order to run PySpark code that has dependencies on more than PySpark itself
> and pure Python. (Pandas, numpy, sklearn, there are more than enough
> examples out there)
> We need to install Anaconda once for each execution environment. In Apache
> Mesos our smallest reliable execution environment is the executor itself,
> thus we need to install Anaconda once per job.
> In YARN, our smallest execution environment is the container, hence we
> need to install Anaconda over and over for each action.
> This obviously poses a problem because of numerous reasons:
> 1. While we can make an excuse in the first action that it is setup time,
> it is obvious that for the second action we are wasting time, a lot. To
> compare Mesos and YARN, starting the second action on Mesos is a matter of
> seconds. In YARN it is measured in minutes.
> 2. We do the same thing over and over again, even if we run on the same
> machine. This makes no sense whatsoever! We are losing the ability to cache
> things. So for example, if I need numpy and that takes about 20-30 seconds
> to download and install, why do I need to install it from scratch over and
> over again?
> 3. It causes code reliability issues. If Miniconda isn't there and I need
> to roll a PySpark job, I now have to setup guards and fallbacks and what
> not? Even worse, I have to find weird tricks to even get access to the
> Miniconda environment, and that is different on Mesos and YARN, so now I
> have a jungle in the code!
> 4. On YARN, PySpark runs on yet a different container! Guess what?! This
> container has no access to miniconda! We currently use --py-files to send a
> list of gazzilion packages. This is different in Mesos, where PySpark
> itself runs in the same executor as the main Amaterasu process.
> So guess what? I now have a jungle in my PySpark invocation code too!
>
> Also take a note that the current implementation for Python 3rd party
> dependencies resolution is Anaconda, this gives us an isolated environment
> that doesn't rely on the existing Python (cause maybe, for some reason, you
> have Python 2.5 on your cluster, which is not supported by new versions of
> data libraries such as pandas, numpy and so forth), in addition it gives us
> the nifty Conda package manager.
> However, it doesn't mean that it has to stay that way. If the need or
> reason arises, we may need to also support pip and support using the native
> Python version (instead of the one supplied by Anaconda).
>
> I want to discuss the possible solutions to this. Please feel free to
> bring up your ideas.
>
> Cheers,
> Nadav
>
>


Re: AMATERASU-24

2018-05-26 Thread Nadav Har Tzvi
I agree with Yaniv that Frameworks should be plugins.
Think about it like this, in the future, hopefully, you will be able to do
something like "sudo yum install amaterasu"
After install the "core" amaterasu using yum, you will be able to use the
new CLI like this: "ama frameworks add " to add a
framework.
Alternatively we could do something like "sudo yum install amaterasu-spark"
I mean, this is what I think anyhow.

As I write this, I've just realized that we should open a thread to discuss
packaging options that we'd like to see implemented.

On 26 May 2018 at 22:53, Yaniv Rodenski  wrote:

> Hi Arun,
>
> You are correct Spark is the first framework, and in my mind,
> frameworks should be treated as plugins. Also, we need to consider that not
> all frameworks will run under the JVM.
> Last, each framework has two modules, a runner (used by both the executor
> and the leader) and runtime, to be used by the actions themselves
> I would suggest the following structure to start with:
> frameworks
>   |-> spark
>   |-> runner
>   |-> runtime
>
> As for the shell scripts, I will leave that for @Nadav, but please have a
> look at PR #17 containing the CLI that will replace the scripts as of
> 0.2.1-incubating.
>
> Cheers,
> Yaniv
>
> On Sat, May 26, 2018 at 5:16 PM, Arun Manivannan  wrote:
>
> > Gentlemen,
> >
> > I am looking into Amaterasu-24 and would like to run the intended changes
> > by you before I make them.
> >
> > Refactor Spark out of Amaterasu executor to it's own project
> >  > issues/AMATERASU-24?filter=allopenissues>
> >
> > I understand Spark is just the first of many frameworks that has been
> lined
> > up for support by Amaterasu.
> >
> > These are the intended changes :
> >
> > 1. Create a new module called "runners" and have the Spark runners under
> > executor pulled into this project
> > (org.apache.executor.execution.actions.runners.spark). We could call it
> > "frameworks" if "runners" is not a great name for this.
> > 2. Will also pull away the Spark dependencies from the Executor to the
> > respective sub-sub-projects (at the moment, just Spark).
> > 3. Since the result of the framework modules would be different bundles,
> > the pattern that I am considering to name the bundle is -
> "runner-spark".
> >  So, it would be "runners:runner-spark" in gradle.
> > 4. On the shell scripts (miniconda and load-spark-env") and the "-cp"
> > passed as commands for the ActionsExecutorLauncher, I could pull them as
> a
> > separate properties of Spark (inside the runner), so that the Application
> > master can use it.
> >
> > Is it okay if I rename the Miniconda install file to miniconda-install
> > using the "wget -O".  The reason why this change is proposed is to avoid
> > hardcoding the conda version inside the code and possibly pull it away
> into
> > amaterasu.properties file. (The changes are in the ama-start shell
> scripts
> > and a couple of places inside the code).
> >
> > Please let me know if this would work.
> >
> > Cheers,
> > Arun
> >
>
>
>
> --
> Yaniv Rodenski
>
> +61 477 778 405
> ya...@shinto.io
>


[jira] [Created] (AMATERASU-27) ama CLI doesn't take into account amaterasu.properties changes in YARN

2018-05-25 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-27:
---

 Summary: ama CLI doesn't take into account amaterasu.properties 
changes in YARN
 Key: AMATERASU-27
 URL: https://issues.apache.org/jira/browse/AMATERASU-27
 Project: AMATERASU
  Issue Type: Bug
Affects Versions: 0.2.1-incubating
 Environment: any hadoop cluster
Reporter: Nadav Har Tzvi
Assignee: Nadav Har Tzvi
 Fix For: 0.2.1-incubating


To reproduce:
 # On a hadoop cluster
 # Setup Amaterasu
 # Run a job
 # Run ama setup again and change something
 # Run a job. The changed setting will not be taken into account.

How to fix:

We need an indication that amaterasu.properties has changed, it can be any 
mechanism

(boolean flag, keep record of last 2 file hashes, etc)

When we execute {{ama run}} then the CLI should check whether or not there is a 
new version of amaterasu.properties. If there is a new version, upload it to 
HDFS.

 

Existing workarounds:

executing {{ama run}} with {{--force-bin}} will completely remove the existing 
Amaterasu HDFS assets and will upload everything again. While it is not amazing 
and consumes tons of time (has to upload the Spark client again), it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AMATERASU-25) Create documentation with ReadTheDocs

2018-05-16 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-25:
---

 Summary: Create documentation with ReadTheDocs
 Key: AMATERASU-25
 URL: https://issues.apache.org/jira/browse/AMATERASU-25
 Project: AMATERASU
  Issue Type: Task
Affects Versions: 0.2.1-incubating
Reporter: Nadav Har Tzvi
Assignee: Nadav Har Tzvi
 Fix For: 0.2.1-incubating


We need to start filling in documentation for Apache Amaterasu.

We will use readthedocs for this purpose.

We need to set up a /docs directory with rst files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Project status

2018-04-02 Thread Nadav Har Tzvi
way
> >> > > to do it, but I couldn't find such JIRA ticket filed.
> >> > >
> >> > > Emailing one mentor directly (or any other community member)
> >isn't a
> >> way
> >> > to
> >> > > build the community. Things need to be discussed in public
> >whenever
> >> > > possible.
> >> > >
> >> > > Given the above, blaming a mentor (whomever you may be referring
> >to)
> >> > > doesn't make sense.
> >> > >
> >> > > * We are ready to release version 0.2.0-incubating, the reason it
> >took
> >> > us a
> >> > > > month to initiate the process is the above automated build,
> >which I
> >> > > > suggested in prior discussion and had no rejections. We will
> >complete
> >> > > this
> >> > > > once build is enabled.
> >> > > >
> >> > >
> >> > > The release itself is a great milestone, but not the purpose to
> >itself.
> >> > >
> >> > >
> >> > > > * as for community growth, we are working with two
> >organizations on
> >> > > running
> >> > > > POCs (which will hopefully grow the user base) one of them is
> >due to
> >> > > start
> >> > > > very soon. I don't want to name them (first of all it's too
> >early,
> >> and
> >> > > also
> >> > > > it is for them to decide if they want to share) but a
> >representative
> >> > from
> >> > > > at least one of those organisations is on the list and is
> >welcomed to
> >> > > share
> >> > > > :)
> >> > > >
> >> > >
> >> > > Great!
> >> > >
> >> > >
> >> > > > * This year I've seen contributions from 4 contributors (not
> >much
> >> more
> >> > > than
> >> > > > 3, I know) but one of them is new (Guy Peleg) and AFAIK
> >additional
> >> > > > longer-term work is done by one more contributor on his local
> >fork
> >> > (Nadav
> >> > > > Har-Tzvi)
> >> > > >
> >> > >
> >> > > I think this is the crux of the problem. Why is longer-term work
> >going
> >> on
> >> > > in a local fork?
> >> > >
> >> > >
> >> > > > * We should be presenting more, and growing the community more
> >which
> >> is
> >> > > > hard to do starting out as a tiny community. Any advice given
> >there
> >> > would
> >> > > > be appreciated.
> >> > > >
> >> > >
> >> > > The first thing has to be do the basics well: on-list
> >communication,
> >> open
> >> > > discussions, no side channels, etc.
> >> > >
> >> > --
> >> > Yaniv Rodenski
> >> >
> >> > +61 477 778 405
> >> > ya...@shinto.io
> >> >
> >>
>



-- 
=
Nadav Har Tzvi
Committer@Amaterasu


[jira] [Created] (AMATERASU-5) Allocated resources aren't cleaned up properly on crash/unexpected halt

2017-11-13 Thread Nadav Har Tzvi (JIRA)
Nadav Har Tzvi created AMATERASU-5:
--

 Summary: Allocated resources aren't cleaned up properly on 
crash/unexpected halt
 Key: AMATERASU-5
 URL: https://issues.apache.org/jira/browse/AMATERASU-5
 Project: AMATERASU
  Issue Type: Bug
 Environment: Centos 7 in Parallels
2 CPUs allocated
8 GB memory
Reporter: Nadav Har Tzvi
Priority: Critical
 Attachments: Screen Shot 2017-11-13 at 20.44.34.png, Screen Shot 
2017-11-13 at 20.45.24.png

Alright, it goes like this:
Given you have a slave with N cpus and M memory.
Given that each job requires 1 cpu and X memory.
When you run a job using ama-start
When you hit ctrl-c in the middle.
Then the next time you start executing Amaterasu, you will have n-1 cpus.
And the next time you start executing Amaterasu, you will have M-X memory.

The missing resources are back only after a reboot of the machine. Pretty darn 
problematic, as it will kill slaves in no time.

I attached images displaying some execution trace logs, I am using a VM with 2 
CPUs and 8 GB memory. You will see that the number of cpus dropping from 2 to 1 
in the first image and then from 1 to 0 (to not mentioned actually) in the 
second image. Available memory behaves in a similar way.

I accidentally discovered it while developing ama-cli where I screwed up 
execution quite a bit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Moving forward

2017-10-26 Thread Nadav Har Tzvi
Sure, no objections here. I will open PR for ama-cli after you are done. 
Actually, after I am done :)

> On 26 Oct 2017, at 2:48, Yaniv Rodenski  wrote:
> 
> OK so GitHub doesn't allow forking empty branches, so I think I'll just
> push the shintoio master to the amaterasu-incubator master as Olivier
> suggested if that's ok with everyone, we have a couple of PRs coming and I
> think we should try to get them to the ASF repo.
> 
> any objections?
> 
> On Thu, Oct 26, 2017 at 4:50 AM, Davor Bonaci  wrote:
> 
>> Many options -- you can simply compare two repositories, and issue a pull
>> request in the GitHub UI.
>> 
>> Alternatively, you can create a fork, clone it locally, set up multiple
>> remotes (to both old and new repositories), rebase and push.
>> 
>> On Wed, Oct 25, 2017 at 1:17 AM, Yaniv Rodenski  wrote:
>> 
>>> Hi Davor,
>>> 
>>> looks like something worth our while.
>>> One question though, how do we issue a PR from a non-forked repo?
>>> 
>>> Cheers,
>>> Yaniv
>>> 
>>> On Tue, Oct 24, 2017 at 6:47 PM, Davor Bonaci  wrote:
>>> 
 (You are also welcome to retain any source history that the original
 repository has -- feel free to push a pull request containing all
>>> commits.
 See Beam's pull request #1 as an example.)
 
 On Sun, Oct 22, 2017 at 5:29 PM, Yaniv Rodenski 
>> wrote:
 
> Thanks Olivier!
> I prefer doing things myself, I’ll push it today.
> 
> On Mon, 23 Oct 2017 at 11:25 am, Olivier Lamy 
>>> wrote:
> 
>> Hi,
>> 
>> On 23 October 2017 at 11:20, Yaniv Rodenski 
>> wrote:
>> 
>>> Hi All,
>>> 
>>> The Podling Report reminder is a really great motivation for
>> moving
>> forward
>>> with Amaterasu :)
>>> I think that on the bootstrapping end we still need to move the
>>> repo
 to
>> the
>>> ASF one.
>>> 
>>> Mentors, do I just do this manually (push local clone) or is
>> there
>>> a
> way
>> to
>>> pull it directly from github?
>> 
>> 
>> You can do it yourself (i.e push the branches you want to save to
>> ASF
>> repo).
>> Or create an INFRA ticket if you want to pull it directly from
>>> github.
>> 
>> 
>>> 
>>> 
>> Also we are very close to version 0.2.0-incubating.
>>> not a lot of open tasks, but I think we should also get our JIRA
>> in
>> order,
>>> I can port everything we have on our old trello, but maybe a
>>> hangouts
>> sync
>>> to go over the open tasks is a good way to push things forward?
>>> 
>>> Cheers,
>>> --
>>> Yaniv Rodenski
>>> 
>> 
>> 
>> 
>> --
>> Olivier Lamy
>> http://twitter.com/olamy | http://linkedin.com/in/olamy
>> 
> --
> Yaniv Rodenski
> 
 
>>> 
>>> 
>>> 
>>> --
>>> Yaniv Rodenski
>>> 
>> 
> 
> 
> 
> -- 
> Yaniv Rodenski