Re: Unsubscribe

2023-12-05 Thread Pat Ferrel
There is no instruction for "issues@mahout.apache.org 
<mailto:issues@mahout.apache.org>” There are instructions for user, dev, and 
commits but I’ve been getting email from lots of other lists in ASF, some are 
Mahout, like issues@, others are not and who knows how to get off those.

I will assume the magic here is to construct the 
issues-unsubscr...@mahout.apache.org 
<mailto:issues-unsubscr...@mahout.apache.org> and send to this address but why 
is this necessary? Why do ASF lists NOT allow “unsubscribe” subject to start 
the unsubscribe handshake? How many humans would save waisted time on both 
sides of this if infra would fix this globally?

Thanks for listening to one more of my rants. Happy Holidays
:-)


> On Dec 5, 2023, at 12:04 PM, Andrew Musselman  wrote:
> 
> Here's how to do it: https://mahout.apache.org/community/mailing-lists.html
> 
> -- Forwarded message -
> From: Pat Ferrel mailto:p...@occamsmachete.com>>
> Date: Tue, Dec 5, 2023 at 11:58 AM
> Subject: Unsubscribe
> To: mailto:issues@mahout.apache.org>>
> 
> 
> Unsubscribe



Unsubscribe

2023-12-05 Thread Pat Ferrel
Unsubscribe


[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2020-10-20 Thread Pat Ferrel (Jira)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217820#comment-17217820
 ] 

Pat Ferrel commented on MAHOUT-2023:


I don't install Mahout as a shell process. This only occurs when trying to use 
the cli. So I don't have a good way to test.

At the time this was observed the cli was by far the most common usage of 
Mahout.

In modern times the cli may not need to be supported since more robust notebook 
or REPL solutions exist I would have no problem personally if we wanted to 
remove support for the CLI 

-- BUT --

This would necessitate rewriting lots of the Mahout docs for recommenders and 
I'm not willing to tackle this since publishing the site is in some state of 
blockage afaik.

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
>     Environment: any
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 14.2
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Release 14.1, RC7

2020-09-30 Thread Pat Ferrel
Still haven’t had a chance to test since it will take some experimentation
to figure out jars needed etc. My test is to replace 0.13 with 0.14.1

Still I see no reason to delay the release for my slow testing

+1


From: Andrew Musselman 

Reply: dev@mahout.apache.org  
Date: September 28, 2020 at 7:31:42 AM
To: Mahout Dev List  
Subject:  Re: [DISCUSS] Release 14.1, RC7

Thanks very much Andy!

On Sun, Sep 27, 2020 at 11:38 PM Andrew Palumbo  wrote:

> All,
>
> Apologies on holding this up a bit; I told Andrew 2x that I was in
process
> of testing and 2x got pulled away. I am +1.
>
>
>
> Re: Jakes comments on dev@, I think if we focus on documentation in the
> next release, we can get things clear.
>
>
>
>
>
>
>
>
>
> 
>
> From: Andrew Palumbo 
>
> Sent: Wednesday, September 23, 2020 9:29 PM
>
> To: priv...@mahout.apache.org 
>
> Subject: Re: [DISCUSS] Release 14.1, RC7
>
>
>
> I have a minute tonight, I will test and vote.
>
>
>
>
>
> On Sep 23, 2020 8:47 AM, Andrew Musselman 
> wrote:
>
>
>
> Just a heads up to the Mahout PMC; we have a few beloved lurkers on the
>
> committee who I would love to see at least some release votes from.
>
>
>
> If anyone wants to do a quick screen share to get your current work
machine
>
> up and running with this release candidate I am happy to spend time with
>
> you. Verifying a release is an hour of time from start to finish, and can
>
> be less after you're set up.
>
>
>
> Thanks for considering it!
>
>
>
> Best
>
> Andrew
>
>
>
> On Wed, Sep 23, 2020 at 7:44 AM Trevor Grant 
>
> wrote:
>
>
>
> > I'm back- will test tonight I hope.
>
> >
>
> > Pat can give a binding, and knows the most about the SBT- so I'd like
to
>
> > see a +1 from him (or -1 if it doesn't work).
>
> >
>
> > I have a binding.
>
> >
>
> > Implicitly, AKM would have a binding +1, but the release master
normally
>
> > doesn't vote until the end.
>
> >
>
> > So that would be 3, but it would be worth exploring a new PMC addition.
>
> >
>
> > On Tue, Sep 22, 2020 at 2:25 AM Christofer Dutz <
> christofer.d...@c-ware.de
>
> > >
>
> > wrote:
>
> >
>
> > > Hi all,
>
> > >
>
> > > It’s been 11 days now and so-far I can only see 1 non-binding vote …
I
>
> > > know that Trevor is on vacation at the moment, but what’s up with the
>
> > > others?
>
> > >
>
> > > And I had a chat with Pat on slack about the SBT thing … I think we
>
> > should
>
> > > discuss and whip up a how-to for SBT and Scala users as soon as we
have
>
> > the
>
> > > release out the door.
>
> > >
>
> > > Chris
>
> > >
>
> > >
>
> >
>
>
>
>


Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to the Attic

2020-08-31 Thread Pat Ferrel
To try to keep this on-subject I’ll say that I’ve been working on what I once 
saw as a next-gen PIO. It is ASL 2, and has 2 engines that ran in PIO — most 
notably the Universal Recommender. We offered to make the Harness project part 
of PIO a couple years back but didn’t get much interest. It is now in 
v0.6.0-SHAPSHOT. The key difference is that it is designed for the user, rather 
than the Data Scientist.

Check Harness out: https://github.com/actionml/harness Contributors are 
welcome. 

We owe everything to PIO where we proved it could be done.



From: Donald Szeto 
Reply: user@predictionio.apache.org 
Date: August 29, 2020 at 3:45:04 PM
To: d...@predictionio.apache.org 
Cc: user@predictionio.apache.org 
Subject:  Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to 
the Attic  

It looks like there is no objection. I will start a vote shortly.

Regards,
Donald

On Mon, Aug 24, 2020 at 1:17 PM Donald Szeto  wrote:
Hi all,

The Apache PredictionIO project had an amazing ride back in its early years. 
Unfortunately, its momentum had declined, and its core technology had fallen 
behind. Although we have received some appeal from the community to help bring 
the project up to speed, the effort is not sufficient.

I think it is about time to archive the project. The proper way to do so is to 
follow the Apache Attic process documented at 
http://attic.apache.org/process.html. This discussion thread is the first step. 
If there is no objection, it will be followed by a voting thread.

Existing users: This move should not impact existing functionality, as the 
source code will still be available through the Apache Attic, in a read-only 
state.

Thank you for your continued support over the years. The project would not be 
possible without your help.

Regards,
Donald

Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to the Attic

2020-08-31 Thread Pat Ferrel
To try to keep this on-subject I’ll say that I’ve been working on what I once 
saw as a next-gen PIO. It is ASL 2, and has 2 engines that ran in PIO — most 
notably the Universal Recommender. We offered to make the Harness project part 
of PIO a couple years back but didn’t get much interest. It is now in 
v0.6.0-SHAPSHOT. The key difference is that it is designed for the user, rather 
than the Data Scientist.

Check Harness out: https://github.com/actionml/harness Contributors are 
welcome. 

We owe everything to PIO where we proved it could be done.



From: Donald Szeto 
Reply: u...@predictionio.apache.org 
Date: August 29, 2020 at 3:45:04 PM
To: dev@predictionio.apache.org 
Cc: u...@predictionio.apache.org 
Subject:  Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to 
the Attic  

It looks like there is no objection. I will start a vote shortly.

Regards,
Donald

On Mon, Aug 24, 2020 at 1:17 PM Donald Szeto  wrote:
Hi all,

The Apache PredictionIO project had an amazing ride back in its early years. 
Unfortunately, its momentum had declined, and its core technology had fallen 
behind. Although we have received some appeal from the community to help bring 
the project up to speed, the effort is not sufficient.

I think it is about time to archive the project. The proper way to do so is to 
follow the Apache Attic process documented at 
http://attic.apache.org/process.html. This discussion thread is the first step. 
If there is no objection, it will be followed by a voting thread.

Existing users: This move should not impact existing functionality, as the 
source code will still be available through the Apache Attic, in a read-only 
state.

Thank you for your continued support over the years. The project would not be 
possible without your help.

Regards,
Donald

Re: [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)

2020-08-12 Thread Pat Ferrel
Big fun. Thanks for putting this together.

I’ll abuse my few Twitter followers with the announcement.


From: Trevor Grant 
Reply: u...@mahout.apache.org 
Date: August 12, 2020 at 5:59:45 AM
To: Mahout Dev List , u...@mahout.apache.org 

Subject:  [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)  

Hey all,  

We got enough people to volunteer for talks that we are going to be putting  
on our very own track at ApacheCon (@Home) this year!  

Check out the schedule here:  
https://www.apachecon.com/acna2020/tracks/mahout.html  

To see the talks live / in real time, please register at:  
https://hopin.to/events/apachecon-home  

But if you can't make it- we plan on pushing all of the recorded sessions  
to the website after.  

Thanks so much everyone, and can't wait to 'see' you there!  

tg  


Re: [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)

2020-08-12 Thread Pat Ferrel
Big fun. Thanks for putting this together.

I’ll abuse my few Twitter followers with the announcement.


From: Trevor Grant 
Reply: user@mahout.apache.org 
Date: August 12, 2020 at 5:59:45 AM
To: Mahout Dev List , user@mahout.apache.org 

Subject:  [ANNOUNCE] Mahout Con 2020 (A sub-track of ApacheCon @ Home)  

Hey all,  

We got enough people to volunteer for talks that we are going to be putting  
on our very own track at ApacheCon (@Home) this year!  

Check out the schedule here:  
https://www.apachecon.com/acna2020/tracks/mahout.html  

To see the talks live / in real time, please register at:  
https://hopin.to/events/apachecon-home  

But if you can't make it- we plan on pushing all of the recorded sessions  
to the website after.  

Thanks so much everyone, and can't wait to 'see' you there!  

tg  


Memory allocation

2020-04-17 Thread Pat Ferrel
I have used Spark for several years and realize from recent chatter on this 
list that I don’t really understand how it uses memory.

Specifically is spark.executor.memory and spark.driver.memory taken from the 
JVM heap when does Spark take memory from JVM heap and when it is from off JVM 
heap.

Since spark.executor.memory and spark.driver.memory are job params, I have 
always assumed that the required memory was off-JVM-heap.  Or am I on the wrong 
track altogether?

Can someone point me to a discussion of this?

thanks

Re: IDE suitable for Spark

2020-04-07 Thread Pat Ferrel
IntelliJ Scala works well when debugging master=local. Has anyone used it for 
remote/cluster debugging? I’ve heard it is possible...


From: Luiz Camargo 
Reply: Luiz Camargo 
Date: April 7, 2020 at 10:26:35 AM
To: Dennis Suhari 
Cc: yeikel valdes , zahidr1...@gmail.com 
, user@spark.apache.org 
Subject:  Re: IDE suitable for Spark  

I have used IntelliJ Spark/Scala with the sbt tool

On Tue, Apr 7, 2020 at 1:18 PM Dennis Suhari  
wrote:
We are using Pycharm resp. R Studio with Spark libraries to submit Spark Jobs. 

Von meinem iPhone gesendet

Am 07.04.2020 um 18:10 schrieb yeikel valdes :



Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is 
missing a lot of the features that we expect from an IDE.

Thanks for sharing though. 

 On Tue, 07 Apr 2020 04:45:33 -0400 zahidr1...@gmail.com wrote 

When I first logged on I asked if there was a suitable IDE for Spark.
I did get a couple of responses.  
Thanks.  

I did actually find one which is suitable IDE for spark.  
That is  Apache Zeppelin.

One of many reasons it is suitable for Apache Spark is.
The  up and running Stage which involves typing bin/zeppelin-daemon.sh start
Go to browser and type http://localhost:8080  
That's it!

Then to
Hit the ground running   
There are also ready to go Apache Spark examples
showing off the type of functionality one will be using in real life production.

Zeppelin comes with  embedded Apache Spark  and scala as default interpreter 
with 20 + interpreters.
I have gone on to discover there are a number of other advantages for real time 
production
environment with Zeppelin offered up by other Apache Products.

Backbutton.co.uk
¯\_(ツ)_/¯  
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



--  


Prof. Luiz Camargo
Educador - Computação



Re: PredictionIO ASF Board Report for Mar 2020

2020-03-19 Thread Pat Ferrel
PredictionIO is scalable BY SCALING ITS SUB-SERVICES. Running on a single
machine sounds like no scaling has been executed or even planned.

How do you scale ANY system?
1) vertical scaling: make the instance larger with more cores, more disk,
and most importantly more memory. Increase whatever resource you need most
but all will be affected eventually.
2) move each service to its own instance. Move the DB, Spark, etc (depends
on what you are using) Then you can scale the sub-service (the ones PIO
uses) independently as needed.

Without a scaling plan you must trim your data to fit the system you have.
For instance save only a few months of data. Unfortunately PIO has no
automatic way to do this, like a TTL. We created a template that you can
run to trim your db by dropping old data. Unfortunately we have not kept up
with PIO versions since we have moved to another ML server that DOES have
TTLs.

If anyone wants to upgrade the template it was last used with PIO 0.12.x
and is here: https://github.com/actionml/db-cleaner

If you continually add data to a bucket it will eventually overflow, how
could it be any other way?



From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: March 19, 2020 at 7:43:08 AM
To: user@predictionio.apache.org 

Subject:  Re: PredictionIO ASF Board Report for Mar 2020

Hello!

My knowledge to predictionio is limited. I was able to set up a
predictionIO server and run on it two templates, the recommendation and
similar item template. The server is on production in my company and we
were having good results. Suddenly, as we feed data to the server, our
cloud machine memory got full and we can't have new data anymore nor we can
process this data. An error message on ubuntu state: "No space left on
device".

I am deploying this server on a single machine without any cluster or the
help of docker. Do you have any suggestion to solve this issue? Also, is
there a way to clean the machine from old data it has?

As a final note, my knowledge in the data engineer and machine learning
field is limited. I understand scala and can work with spark. However, I am
willing to dig deeper into predictionio. Do you think there is a way I can
contribute to the community in one way or another? Or you're just looking
for true experts in order to avoid moving the project to attic?

Regards
Sami Serbey
--
*From:* Donald Szeto 
*Sent:* Tuesday, March 10, 2020 8:26 PM
*To:* user@predictionio.apache.org ;
d...@predictionio.apache.org 
*Subject:* PredictionIO ASF Board Report for Mar 2020

Hi all,

Please take a look at the draft report below and make your comments or
edits as you see fit. The draft will be submitted on Mar 11, 2020.

Regards,
Donald

## Description:
The mission of Apache Predictionio is the creation and maintenance of
software
related to a machine learning server built on top of state-of-the-art open
source stack, that enables developers to manage and deploy production-ready
predictive services for various kinds of machine learning tasks

## Issues:
Update: A community member, who's a committer and PMC of another Apache
project, has expressed interest in helping. The member has been engaged and
we are waiting for actions from that member.

Last report: No PMC chair nominee was nominated a week after the PMC chair
expressed
intention to resign from the chair on the PMC mailing list.

## Membership Data:
Apache PredictionIO was founded 2017-10-17 (2 years ago)
There are currently 29 committers and 28 PMC members in this project.
The Committer-to-PMC ratio is roughly 8:7.

Community changes, past quarter:
- No new PMC members. Last addition was Andrew Kyle Purtell on 2017-10-17.
- No new committers were added.

## Project Activity:
Sparse activities only on mailing list.

Recent releases:

0.14.0 was released on 2019-03-11.
0.13.0 was released on 2018-09-20.
0.12.1 was released on 2018-03-11.

## Community Health:
Update: A community member, who's a committer and PMC of another Apache
project, has expressed interest in helping. The member has been engaged and
we are waiting for actions from that member to see if a nomination to PMC
and chair would be appropriate.

Last report: We are seeking new leadership for the project at the moment to
bring it out
of maintenance mode. Moving to the attic would be the last option.


Re: Livy on Kubernetes support

2020-01-14 Thread Pat Ferrel
+1 from another user fwiw. We also have livy containers and helm charts. The 
real problem is deploying a Spark Cluster in k8s. We know of no working images 
for this. The Spark team seems focused on deploying Jobs with k8s, which is 
fine but is not enough. We need to deploy Spark itself.  We created our own 
containers and charts for this too.

Is anyone interested in sharing images that work with k8s for Livy and/or 
Spark? Ours are all ASF licensed OSS.

From: Marco Gaido 
Reply: dev@livy.incubator.apache.org 
Date: January 14, 2020 at 2:35:34 PM
To: dev@livy.incubator.apache.org 
Subject:  Re: Livy on Kubernetes support  

Hi Aliaksandr,  

thanks for your email and you work on this feature. As I mentioned to you  
in the PR, I agree with you on the usefulness of this feature and you have  
a big +1 from me having it in Livy. Unfortunately, it is not my area of  
expertise, so I don't feel confident merging it without other reviewers  
taking a careful look at it.  

For the future, I think a better approach would be first to discuss and  
define the architecture with the community, so that it is shared and  
accepted by the whole community before the PR is out. This helps also  
getting people involved and makes easier having them being able to review  
the PR. Anyway, after you have split the PRs, I think they are reasonable  
and we can discuss on them.  

Looking forward to have your contribution in Livy.  

Thanks,  
Marco  

Il giorno mar 14 gen 2020 alle ore 12:48 Aliaksandr Sasnouskikh <  
jahstreetl...@gmail.com> ha scritto:  

> Hi community,  
>  
> About a year ago I've started to work on the patch to Apache Livy for Spark  
> on Kubernetes support in the scope of the project I've been working on.  
> Since that time I've created a PR  
> https://github.com/apache/incubator-livy/pull/167 which have already been  
> discussed and reviewed a lot. After finalizing the work in the result of  
> the PR discussions I've started to split the work introduced in the base PR  
> into smaller pieces to make it easier to separate the core and aux  
> functionality, and as a result - easier to review and merge. The first core  
> PR is https://github.com/apache/incubator-livy/pull/249.  
>  
> Also I've created the repos with Docker images (  
> https://github.com/jahstreet/spark-on-kubernetes-docker) and Helm charts (  
> https://github.com/jahstreet/spark-on-kubernetes-helm) with the possible  
> stack the users may want to use Livy on Kubernetes with, which potentially  
> in the future can be partially moved to Livy repo to keep the artifacts  
> required to run Livy on Kubernetes in a single place.  
>  
> Until now I've received the positive feedback from more than 10 projects  
> about the usage of the patch. Several of them could be found in the  
> discussions of the base PR. Also my repos supporting this feature have  
> around 35 stars and 15 forks in total and were referenced in Spark related  
> Stackoverflow and Kubernetes slack channel discussions. So the users use it  
> already.  
>  
> You may think "What this guy wants from us then!?"... Well, I would like to  
> ask for your time and expertise to help push it forward and ideally make it  
> merged.  
>  
> Probably before I started coding I should have checked with the  
> contributors if this feature may have value for the project and how is  
> better to implement it, but I hope it is never too late;) So I'm here to  
> share with you the the thoughts behind it.  
>  
> The idea of Livy on Kubernetes is simply to replicate the logic it has for  
> Yarn API to Kubernetes API, which can be easily done since the interfaces  
> for the Yarn API are really similar to the ones of the Kubernetes.  
> Nevertheless this easy-to-do patch opens Livy the doors to Kubernetes which  
> seems to be really useful for the community taking into account the  
> popularity of Kubernetes itself and the latest releases of Spark supporting  
> Kubernetes as well.  
>  
> Proposed Livy job submission flow:  
> - Generate appTag and add  
> `spark.kubernetes.[driver/executor].label.SPARK_APP_TAG_LABEL` to Spark  
> config  
> - Run spark-submit in cluster-mode with Kubernetes master  
> - Start monitoring thread which resolves Spark Driver and Executor Pods  
> using the `SPARK_APP_TAG_LABEL`s assigned during the job submission  
> - Create additional Kubernetes resource if necessary: Spark UI service,  
> Ingress, CRDs, etc.  
> - Fetch Spark Pods statuses, Driver logs and other diagnostics information  
> while Spark Pods are running  
> - Remove Spark job resources (completed/failed Driver Pod, Service,  
> ConfigMap, etc.) from the cluster after the job completion/failure after  
> the configured timeout  
>  
> The core functionality (covered by  
> https://github.com/apache/incubator-livy/pull/249):  
> - Submission of Batch jobs and Interactive sessions  
> - Caching Driver logs and Kubernetes Pods diagnostics  
>  
> Aux features (introduced in  

Re: Possible missing mentor(s)

2019-09-01 Thread Pat Ferrel
Seems like some action should be taken before 2 years, even if it is to
close the PR because it is not appropriate. Isn’t this the responsibility
of the chair to guard against committer changes where the contributor is
still willing? Or if a mentor is guiding the PR they should help it get
unstalled if the contributor is still willing to make changes.

The point (2 year old PRs) seems well taken. The question should be; what
can be done about this?

For what it’s worth, we are just starting to use Livy and wish it was part
of Spark. We would like to treat Spark as a “microservice” as a compute
engine. The Spark team seems to want to make Spark integral to the
architecture of ALL applications that use it. Very odd from our point of
view.

So to integrate Livy we deeply hope it doesn’t fall into disrepair and are
willing to help when we run into something.


From: Sid Anand  
Reply: dev@livy.incubator.apache.org 

Date: September 1, 2019 at 11:19:00 AM
To: dev@livy.incubator.apache.org 

Subject:  Re: Possible missing mentor(s)

"Second, if someone has a *good* and *large* contribution history, and
actively participates in community, we will add him without doubt. Third,
2-year old open PRs doesn't stand anything, some reviewers left the
community and PRs get staled, it is quite common, especially in large
community."

Meisam has 7 closed and 3 open PRs - of the 4 oldest open PRs in Livy (I
see 4 in 2017), 2 are his. He's ranked #10 in the contributor list --
It's not a large contribution history mostly because it takes so long to
merge and he has been consistently active for 2 years. The size of the
community doesn't seem a factor here with <200 closed PRs and <50
contributors.

How are you prioritizing PR merges if you think having a 2 year old open PR
is okay and you don't a ton of open PRs?
-s

On Sun, Sep 1, 2019 at 2:25 AM Saisai Shao  wrote:

> First, we're scaling the PR review, but we only have few active
committers,
> so the merge may not be fast.
> Second, if someone has a *good* and *large* contribution history, and
> actively participates in community, we will add him without doubt.
> Third, 2-year old open PRs doesn't stand anything, some reviewers left
the
> community and PRs get staled, it is quite common, especially in large
> community.
>
> Sid Anand  于2019年9月1日周日 下午4:46写道:
>
> > Apache projects promote contributors to committers based on
contributions
> > made, not on an expectation of future activity. That's the Apache way
per
> > my understanding. Over time, folks become inactive and busy -- life
> > happens, I get it. May I ask what are you folks doing to scale PR
review
> > and merging? Are you adding new committers? Do you feel that 2-year old
> > open PRs is where you wish to be and is the right way to grow a
> community?
> >
> > On Sun, Sep 1, 2019 at 1:46 AM Sid Anand  wrote:
> >
> > > Apache projects promote contributors to committers based on
> contributions
> > > made, not on an expectation of future activity. That's the Apache way
> per
> > > my understanding. Over time, folks become inactive and busy -- life
> > > happens, I get it. May I ask what are you folks doing to scale PR
> review
> > > and merging? Are you adding new committers? Do you feel that 2-year
> old
> > > open PRs is where you wish to be and is the right way to grow a
> > community?
> > >
> > > -s
> > >
> > > On Sun, Sep 1, 2019 at 12:59 AM Saisai Shao 
> > > wrote:
> > >
> > >> It's unfair to say there's underlying bias. Livy project is a small
> > >> project, the contributor diversity may not be as rich as popular
> project
> > >> like Spark, it is not fair to say that the contributions only limits
> to
> > >> someones, so project is biased. There're many small Apache project
> which
> > >> has only few contributors, can we say those projects are biased?
Also
> > for
> > >> years the committers have joined and left the community, it is hard
to
> > >> track every contribution in time, as we're not a full-time Livy open
> > >> source
> > >> contributors. I also have several PRs left unreviewed for years.
It's
> > >> quite
> > >> common even for large project like Spark, Hadoop, there're so many
> > >> un-merged PRs left for several years. It's unfair to say the project
> is
> > >> biased, unhealthy because of some un-merged PRs.
> > >>
> > >> The community is small but free and open, I would deny that the
> > community
> > >> is unhealthy especially biased, this is an irresponsible and
> subjective
> > >> word.
> > >>
> > >> Sid Anand  于2019年9月1日周日 上午4:20写道:
> > >>
> > >> > Folks!
> > >> > We've (several devs, myself included) contacted the livy dev list
> and
> > >> the
> > >> > owners DL several times. Our PRs stagnated over a few years. Livy
> is a
> > >> > central component in PayPal's Data Infra (Our data footprint is
80+
> > PB).
> > >> > The project seems pretty unhealthy. After a few years, this dev
> moved
> > on
> > >> > and the state of our PR may be harder to define, with both
absentee
> > >> > men

Re: k8s orchestrating Spark service

2019-07-03 Thread Pat Ferrel
Thanks for the in depth explanation.

These methods would require us to architect our Server around Spark and it
is actually designed to be independent of the ML implementation. SparkML is
an important algo source, to be sure, but so is TensorFlow, and Python
non-spark libs among others. So Spark stays at arms length in a
microservices pattern. Doing this with access to Job status and management
is why Livy and the (Spark) Job Server exist. To us the ideal is treating
Spark like a compute server that will respond to a service API for job
submittal and control.

None of the above is solved by k8s Spark. Further we find that the Spark
Programatic API does not support deploy mode = “cluster”. This means we
have to take a simple part of our code and partition it into new Jars only
to get spark-submit to work. To help with Job tracking and management when
you are not using the Programatic API we look to Livy. I guess if you ask
our opinion of spark-submit, we’d (selfishly) say it hides architectural
issues that should be solved in the Spark Programatic API but the
popularity of spark-submit is causing the community to avoid these or just
not see or care about them. I guess we’ll see if Spark behind Livy gives us
what we want.

Maybe this is unusual but we see Spark as a service, not an integral
platform. We also see Kubernetes as very important but optional for HA or
when you want to scale horizontally, basically when vertical is not
sufficient. Vertical scaling is more cost effective so Docker Compose is a
nice solution for simpler, Kubernetes-less deployments.

So if we are agnostic about the job master, and communicate through Livy,
we are back to orchestrating services with Docker and Kubernetes. If k8s
becomes a super duper job master, great! But it doesn’t solve todays
question.


From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 5:14:05 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service

> We’d like to deploy Spark Workers/Executors and Master (whatever master
is easiest to talk about since we really don’t care) in pods as we do with
the other services we use. Replace Spark Master with k8s if you insist. How
do the executors get deployed?



When running Spark against Kubernetes natively, the Spark library handles
requesting executors from the API server. So presumably one would only need
to know how to start the driver in the cluster – maybe spark-operator,
spark-submit, or just starting the pod and making a Spark context in client
mode with the right parameters. From there, the Spark scheduler code knows
how to interface with the API server and request executor pods according to
the resource requests configured in the app.



> We have a machine Learning Server. It submits various jobs through the
Spark Scala API. The Server is run in a pod deployed from a chart by k8s.
It later uses the Spark API to submit jobs. I guess we find spark-submit to
be a roadblock to our use of Spark and the k8s support is fine but how do
you run our Driver and Executors considering that the Driver is part of the
Server process?



It depends on how the server runs the jobs:

   - If each job is meant to be a separate forked driver pod / process: The
   ML server code can use the SparkLauncher API
   
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html>
   and configure the Spark driver through that API. Set the master to point to
   the Kubernetes API server and set the parameters for credentials according
   to your setup. SparkLauncher is a thin layer on top of spark-submit; a
   Spark distribution has to be packaged with the ML server image and
   SparkLauncher would point to the spark-submit script in said distribution.
   - If all jobs run inside the same driver, that being the ML server: One
   has to start the ML server with the right parameters to point to the
   Kubernetes master. Since the ML server is a driver, one has the option to
   use spark-submit or SparkLauncher to deploy the ML server itself.
   Alternatively one can use a custom script to start the ML server, then the
   ML server process has to create a SparkContext object parameterized against
   the Kubernetes server in question.



I hope this helps!



-Matt Cheah

*From: *Pat Ferrel 
*Date: *Monday, July 1, 2019 at 5:05 PM
*To: *"user@spark.apache.org" , Matt Cheah <
mch...@palantir.com>
*Subject: *Re: k8s orchestrating Spark service



We have a machine Learning Server. It submits various jobs through the
Spark Scala API. The Server is run in a pod deployed from a chart by k8s.
It later uses the Spark API to submit jobs. I guess we find spark-submit to
be a roadblock to our use of Spark and the k8s support is fine but how do
you run our Driver and Executors considering that the Driver is part of the
Server process?



Maybe we are talking past each other with some mistaken assumptions (on my
part perhaps).







From: Pat

Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel
Oops, should have said: "I may have missed something but I don’t recall PIO
being released by Apache as an ASF maintained container/image release
artifact."


From: Pat Ferrel  
Reply: user@predictionio.apache.org 

Date: July 3, 2019 at 11:16:43 AM
To: Wei Chen  ,
d...@predictionio.apache.org 
, user@predictionio.apache.org
 
Subject:  Re: JAVA_HOME is not set

BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: d...@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: user@predictionio.apache.org 

Cc: d...@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>


Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel
Oops, should have said: "I may have missed something but I don’t recall PIO
being released by Apache as an ASF maintained container/image release
artifact."


From: Pat Ferrel  
Reply: u...@predictionio.apache.org 

Date: July 3, 2019 at 11:16:43 AM
To: Wei Chen  ,
dev@predictionio.apache.org 
, u...@predictionio.apache.org
 
Subject:  Re: JAVA_HOME is not set

BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: dev@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: u...@predictionio.apache.org 

Cc: dev@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>


Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel
BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: d...@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: user@predictionio.apache.org 

Cc: d...@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>


Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel
BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: dev@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: u...@predictionio.apache.org 

Cc: dev@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>


Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
We have a machine Learning Server. It submits various jobs through the
Spark Scala API. The Server is run in a pod deployed from a chart by k8s.
It later uses the Spark API to submit jobs. I guess we find spark-submit to
be a roadblock to our use of Spark and the k8s support is fine but how do
you run our Driver and Executors considering that the Driver is part of the
Server process?

Maybe we are talking past each other with some mistaken assumptions (on my
part perhaps).



From: Pat Ferrel  
Reply: Pat Ferrel  
Date: July 1, 2019 at 4:57:20 PM
To: user@spark.apache.org  , Matt
Cheah  
Subject:  Re: k8s orchestrating Spark service

k8s as master would be nice but doesn’t solve the problem of running the
full cluster and is an orthogonal issue.

We’d like to deploy Spark Workers/Executors and Master (whatever master is
easiest to talk about since we really don’t care) in pods as we do with the
other services we use. Replace Spark Master with k8s if you insist. How do
the executors get deployed?

We have our own containers that almost work for 2.3.3. We have used this
before with older Spark so we are reasonably sure it makes sense. We just
wonder if our own image builds and charts are the best starting point.

Does anyone have something they like?


From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 4:45:55 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service

Sorry, I don’t quite follow – why use the Spark standalone cluster as an
in-between layer when one can just deploy the Spark application directly
inside the Helm chart? I’m curious as to what the use case is, since I’m
wondering if there’s something we can improve with respect to the native
integration with Kubernetes here. Deploying on Spark standalone mode in
Kubernetes is, to my understanding, meant to be superseded by the native
integration introduced in Spark 2.4.



*From: *Pat Ferrel 
*Date: *Monday, July 1, 2019 at 4:40 PM
*To: *"user@spark.apache.org" , Matt Cheah <
mch...@palantir.com>
*Subject: *Re: k8s orchestrating Spark service



Thanks Matt,



Actually I can’t use spark-submit. We submit the Driver programmatically
through the API. But this is not the issue and using k8s as the master is
also not the issue though you may be right about it being easier, it
doesn’t quite get to the heart.



We want to orchestrate a bunch of services including Spark. The rest work,
we are asking if anyone has seen a good starting point for adding Spark as
a k8s managed service.




From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service



I would recommend looking into Spark’s native support for running on
Kubernetes. One can just start the application against Kubernetes directly
using spark-submit in cluster mode or starting the Spark context with the
right parameters in client mode. See
https://spark.apache.org/docs/latest/running-on-kubernetes.html
[spark.apache.org]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=4XyH4cxucBNQAlSaHyR4gXJbHIo9g9vcur4VzBCYkwk&s=Q6mv_pZUq3UmxJU6EiJYJvG8L44WBeWJyAnw3vG0GBw&e=>



I would think that building Helm around this architecture of running Spark
applications would be easier than running a Spark standalone cluster. But
admittedly I’m not very familiar with the Helm technology – we just use
spark-submit.



-Matt Cheah

*From: *Pat Ferrel 
*Date: *Sunday, June 30, 2019 at 12:55 PM
*To: *"user@spark.apache.org" 
*Subject: *k8s orchestrating Spark service



We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.



Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.



So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.



Thanks


Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
k8s as master would be nice but doesn’t solve the problem of running the
full cluster and is an orthogonal issue.

We’d like to deploy Spark Workers/Executors and Master (whatever master is
easiest to talk about since we really don’t care) in pods as we do with the
other services we use. Replace Spark Master with k8s if you insist. How do
the executors get deployed?

We have our own containers that almost work for 2.3.3. We have used this
before with older Spark so we are reasonably sure it makes sense. We just
wonder if our own image builds and charts are the best starting point.

Does anyone have something they like?


From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 4:45:55 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service

Sorry, I don’t quite follow – why use the Spark standalone cluster as an
in-between layer when one can just deploy the Spark application directly
inside the Helm chart? I’m curious as to what the use case is, since I’m
wondering if there’s something we can improve with respect to the native
integration with Kubernetes here. Deploying on Spark standalone mode in
Kubernetes is, to my understanding, meant to be superseded by the native
integration introduced in Spark 2.4.



*From: *Pat Ferrel 
*Date: *Monday, July 1, 2019 at 4:40 PM
*To: *"user@spark.apache.org" , Matt Cheah <
mch...@palantir.com>
*Subject: *Re: k8s orchestrating Spark service



Thanks Matt,



Actually I can’t use spark-submit. We submit the Driver programmatically
through the API. But this is not the issue and using k8s as the master is
also not the issue though you may be right about it being easier, it
doesn’t quite get to the heart.



We want to orchestrate a bunch of services including Spark. The rest work,
we are asking if anyone has seen a good starting point for adding Spark as
a k8s managed service.




From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service



I would recommend looking into Spark’s native support for running on
Kubernetes. One can just start the application against Kubernetes directly
using spark-submit in cluster mode or starting the Spark context with the
right parameters in client mode. See
https://spark.apache.org/docs/latest/running-on-kubernetes.html
[spark.apache.org]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=4XyH4cxucBNQAlSaHyR4gXJbHIo9g9vcur4VzBCYkwk&s=Q6mv_pZUq3UmxJU6EiJYJvG8L44WBeWJyAnw3vG0GBw&e=>



I would think that building Helm around this architecture of running Spark
applications would be easier than running a Spark standalone cluster. But
admittedly I’m not very familiar with the Helm technology – we just use
spark-submit.



-Matt Cheah

*From: *Pat Ferrel 
*Date: *Sunday, June 30, 2019 at 12:55 PM
*To: *"user@spark.apache.org" 
*Subject: *k8s orchestrating Spark service



We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.



Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.



So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.



Thanks


Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
Thanks Matt,

Actually I can’t use spark-submit. We submit the Driver programmatically
through the API. But this is not the issue and using k8s as the master is
also not the issue though you may be right about it being easier, it
doesn’t quite get to the heart.

We want to orchestrate a bunch of services including Spark. The rest work,
we are asking if anyone has seen a good starting point for adding Spark as
a k8s managed service.


From: Matt Cheah  
Reply: Matt Cheah  
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel  ,
user@spark.apache.org  
Subject:  Re: k8s orchestrating Spark service

I would recommend looking into Spark’s native support for running on
Kubernetes. One can just start the application against Kubernetes directly
using spark-submit in cluster mode or starting the Spark context with the
right parameters in client mode. See
https://spark.apache.org/docs/latest/running-on-kubernetes.html



I would think that building Helm around this architecture of running Spark
applications would be easier than running a Spark standalone cluster. But
admittedly I’m not very familiar with the Helm technology – we just use
spark-submit.



-Matt Cheah

*From: *Pat Ferrel 
*Date: *Sunday, June 30, 2019 at 12:55 PM
*To: *"user@spark.apache.org" 
*Subject: *k8s orchestrating Spark service



We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.



Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.



So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.



Thanks


k8s orchestrating Spark service

2019-06-30 Thread Pat Ferrel
We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.

Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.

So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.

Thanks


Re: Support for Livy with Scala 2.12

2019-06-03 Thread Pat Ferrel
Spark 2.4.x does not require scala 2.12, in fact is is marked as
“experimental” here:
https://spark.apache.org/releases/spark-release-2-4-0.html


Moving to a new scala version is often a pain, because the libs you use may
not be upgraded and version matter *unlike typical Java updates). Scala
creates JVM objects and names them as it pleases. Sometimes naming changes
from version to version of Scala and this causes big problem in using mixed
libs from different versions of Scala.

I’m no expert in Livy, but imagine you may need to build against a newer
Spark. But avoid Scala 2.12 for now.

From: santosh.dan...@ubs.com 

Reply: user@livy.incubator.apache.org 

Date: June 3, 2019 at 12:51:20 PM
To: user@livy.incubator.apache.org 

Subject:  Support for Livy with Scala 2.12

Hi,



We have just upgraded our spark cluster version 2.3 to 2.4.2 and it broke
Livy.  It's throwing exception "Cannot Find Livy REPL Jars".  Looks like I
have to build Livy using Scala 2.12 version.



Can anyone advise how to build Livy with Scala 2.12 with Maven? Will
changing the scala version from 2.11 to 2.12 would build livy? Please
advise.







The code failed because of a fatal error:

Invalid status code '400' from http://localhost:8998/sessions
with error payload: {"msg":"requirement failed: Cannot find Livy REPL
jars."}.



Thanks
Santosh

Please visit our website at
http://financialservicesinc.ubs.com/wealth/E-maildisclaimer.html
for important disclosures and information about our e-mail
policies. For your protection, please do not transmit orders
or instructions by e-mail or include account numbers, Social
Security numbers, credit card numbers, passwords, or other
personal information.


Re: run new spark version on old spark cluster ?

2019-05-20 Thread Pat Ferrel
It is always dangerous to run a NEWER version of code on an OLDER cluster.
The danger increases with the semver change and this one is not just a
build #. In other word 2.4 is considered to be a fairly major change from
2.3. Not much else can be said.


From: Nicolas Paris  
Reply: user@spark.apache.org  
Date: May 20, 2019 at 11:02:49 AM
To: user@spark.apache.org  
Subject:  Re: run new spark version on old spark cluster ?

> you will need the spark version you intend to launch with on the machine
you
> launch from and point to the correct spark-submit

does this mean to install a second spark version (2.4) on the cluster ?

thanks

On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> yarn can happily run multiple spark versions side-by-side
> you will need the spark version you intend to launch with on the machine
you
> launch from and point to the correct spark-submit
>
> On Mon, May 20, 2019 at 1:50 PM Nicolas Paris 
wrote:
>
> Hi
>
> I am wondering whether that's feasible to:
> - build a spark application (with sbt/maven) based on spark2.4
> - deploy that jar on yarn on a spark2.3 based installation
>
> thanks by advance,
>
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Fwd: Spark Architecture, Drivers, & Executors

2019-05-17 Thread Pat Ferrel
In order to create an application that executes code on Spark we have a
long lived process. It periodically runs jobs programmatically on a Spark
cluster, meaning it does not use spark-submit. The Jobs it executes have
varying requirements for memory so we want to have the Spark Driver run in
the cluster.

This kind of architecture does not work very well with Spark as we
understand it. The issue is that there is no way to run in
deployMode=cluster. This setting is ignored when launching a jobs
programmatically (why is it not an exception?). This in turn means that our
launching application needs to be run on a machine that is big enough to
run the worst case Spark Driver. This is completely impractical due to our
use case (a generic always on Machine Learning Server).

What we would rather do is have the Scala closure that has access to the
Spark Context be treated as the Spark Driver and run in the cluster. There
seems to be no way to do this with off-the-shelf Spark.

This seems like a very common use case but maybe we are too close to it. We
are aware of the Job Server and Apache Livy, which seem to give us what we
need.

Are these the best solutions? Is there a way to do what we want without
spark-submit? Have others here solved this in some other way?


Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Pat Ferrel
Streams have no end until watermarked or closed. Joins need bounded
datasets, et voila. Something tells me you should consider the streaming
nature of your data and whether your joins need to use increments/snippets
of infinite streams or to re-join the entire contents of the streams
accumulated at checkpoints.


From: Joe Ammann  
Reply: Joe Ammann  
Date: May 6, 2019 at 6:45:13 AM
To: user@spark.apache.org  
Subject:  Spark structured streaming watermarks on nested attributes

Hi all

I'm pretty new to Spark and implementing my first non-trivial structured
streaming job with outer joins. My environment is a Hortonworks HDP 3.1
cluster with Spark 2.3.2, working with Python.

I understood that I need to provide watermarks and join conditions for left
outer joins to work. All my incoming Kafka streams have an attribute
"LAST_MODIFICATION" which is well suited to indicate the event time, so I
chose that for watermarking. Since I'm joining from multiple topics where
the incoming messages have common attributes, I though I'd prefix/nest all
incoming messages. Something like

entity1DF.select(struct("*").alias("entity1")).withWatermark("entity1.LAST_MODIFICATION")

entity2DF.select(struct("*").alias("entity2")).withWatermark("entity2.LAST_MODIFICATION")


Now when I try to join such 2 streams, it would fail and tell me that I
need to use watermarks

When I leave the watermarking attribute "at the top level", everything
works as expected, e.g.

entity1DF.select(struct("*").alias("entity1"),
col("LAST_MODIFICATION").alias("entity1_LAST_MODIFICATION")).withWatermark("entity1_LAST_MODIFICATION")


Before I hunt this down any further, is this kind of a known limitation? Or
am I doing something fundamentally wrong?

-- 
CU, Joe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Pat Ferrel
@Riccardo

Spark does not do the DL learning part of the pipeline (afaik) so it is
limited to data ingestion and transforms (ETL). It therefore is optional
and other ETL options might be better for you.

Most of the technologies @Gourav mentions have their own scaling based on
their own compute engines specialized for their DL implementations, so be
aware that Spark scaling has nothing to do with scaling most of the DL
engines, they have their own solutions.

From: Gourav Sengupta 

Reply: Gourav Sengupta 

Date: May 4, 2019 at 10:24:29 AM
To: Riccardo Ferrari  
Cc: User  
Subject:  Re: Deep Learning with Spark, what is your experience?

Try using MxNet and Horovod directly as well (I think that MXNet is worth a
try as well):
1.
https://medium.com/apache-mxnet/distributed-training-using-apache-mxnet-with-horovod-44f98bf0e7b7
2.
https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/rel_19-01.html
3. https://aws.amazon.com/mxnet/
4.
https://aws.amazon.com/blogs/machine-learning/aws-deep-learning-amis-now-include-horovod-for-faster-multi-gpu-tensorflow-training-on-amazon-ec2-p3-instances/


Ofcourse Tensorflow is backed by Google's advertisement team as well
https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/


Regards,




On Sat, May 4, 2019 at 10:59 AM Riccardo Ferrari  wrote:

> Hi list,
>
> I am trying to undestand if ti make sense to leverage on Spark as enabling
> platform for Deep Learning.
>
> My open question to you are:
>
>- Do you use Apache Spark in you DL pipelines?
>- How do you use Spark for DL? Is it just a stand-alone stage in the
>workflow (ie data preparation script) or is it  more integrated
>
> I see a major advantage in leveraging on Spark as a unified entrypoint,
> for example you can easily abstract data sources and leverage on existing
> team skills for data pre-processing and training. On the flip side you may
> hit some limitations including supported versions and so on.
> What is your experience?
>
> Thanks!
>


Livy with Standalone Spark Master

2019-04-20 Thread Pat Ferrel
Does Livy work with a Standalone Spark Master?


Re: Source build of PredictionIO with Hadoop 3.x and Hbase 2.x

2019-04-17 Thread Pat Ferrel
Upgrading to the latest is usually not needed and can cause problems. Code
built for Spark 2.1 almost 100% run on Spark 2.4 so why build for 2.4?

The magic combination in the default build is very likely to support your
newer versions of installed services since they are all backwards
compatible to some extent. Don’t make trouble for yourself.

Also some things will not work on newer versions. For instance ES 6 made
significant query changes vs ES 5 and so the UR template does not yet work
with ES 6.

My advice is always use the default build and plan to run on the latest
stable services. So install Spark 2.4 but build for whatever is in the
build.sbt. This is especially true since all templates have a build.sbt
also and will need to be upgraded as you did for PIO.

Avoid this time sync and use defaults unless absolutely required. But
deploy newer more stable versions when they don't cross a major version
number.


From: Selvaraju Sellamuthu 

Reply: user@predictionio.apache.org 

Date: April 17, 2019 at 2:08:45 PM
To: user@predictionio.apache.org 

Subject:  Source build of PredictionIO with Hadoop 3.x and Hbase 2.x

Hi Team,

I tried to rebuild PredictionIO package with Spark 2.3.2, Elastic Search
6.x, Hadoop 3.x and HBase 2.x. Seems PredictionIO supports Hadoop 2.x and
Hbase 1.x. I am getting the below errors. Anyone tried upgrading the
dependancies ?

Need guidance to build the package one by one in the PredictionIO source
build.


[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBEventsUtil.scala:166:14:
too many arguments for method add: (x$1:
org.apache.hadoop.hbase.Cell)org.apache.hadoop.hbase.client.Put
[error]   put.add(eBytes, col, Bytes.toBytes(v))
[error]  ^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBEventsUtil.scala:170:14:
too many arguments for method add: (x$1:
org.apache.hadoop.hbase.Cell)org.apache.hadoop.hbase.client.Put
[error]   put.add(eBytes, col, Bytes.toBytes(v))
[error]  ^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBLEvents.scala:45:60:
not found: type HTableInterface
[error]   def getTable(appId: Int, channelId: Option[Int] = None):
HTableInterface =
[error]^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/StorageClient.scala:29:8:
object HConnection is not a member of package org.apache.hadoop.hbase.client
[error] import org.apache.hadoop.hbase.client.HConnection
[error]^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/StorageClient.scala:36:19:
not found: type HConnection
[error]   val connection: HConnection,
[error]   ^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBPEvents.scala:23:8:
object TableInputFormat is not a member of package
org.apache.hadoop.hbase.mapreduce
[error] import org.apache.hadoop.hbase.mapreduce.{TableInputFormat,
TableOutputFormat}
[error]^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBPEvents.scala:34:57:
type mismatch;
[error]  found   : String
[error]  required: org.apache.hadoop.hbase.TableName
[error] if (!client.admin.tableExists(HBEventsUtil.tableName(namespace,
appId, channelId))) {
[error] ^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBPEvents.scala:63:14:
not found: value TableInputFormat
[error] conf.set(TableInputFormat.INPUT_TABLE,
[error]  ^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBPEvents.scala:78:14:
not found: value TableInputFormat
[error] conf.set(TableInputFormat.SCAN,
PIOHBaseUtil.convertScanToString(scan))
[error]  ^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBPEvents.scala:81:48:
not found: type TableInputFormat
[error] val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
[error]^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/src/main/scala/org/apache/predictionio/data/storage/hbase/HBPEvents.scala:84:55:
type mismatch;
[error]  found   : Any
[error]  required: org.apache.hadoop.hbase.client.Result
[error] case (key, row) => HBEventsUtil.resultToEvent(row, appId)
[error]   ^
[error]
/Users/admin/pionew/PredictionIO-0.14.0/storage/hbase/s

Re: new install help

2019-04-15 Thread Pat Ferrel
Most people running on a Windows machine use a VM running Linux. You will
run into constant issues if you go down another road with something like
cygwin, so avoid the headache.


From: Steve Pruitt  
Reply: user@predictionio.apache.org 

Date: April 15, 2019 at 10:59:09 AM
To: user@predictionio.apache.org 

Subject:  new install help

I installed on a Windows 10 box.  A couple of questions and then a problem
I have.



I downloaded the binary distribution.

I already had Spark installed, so I changed pio-env.sh to point to my Spark.

I downloaded and installed Postgres.  I downloaded the jdbc driver and put
it in the PredictionIO-0.14.0\lib folder.



My questions are:

Reading the PIO install directions I cannot tell if ElasticSearch and HBase
are optional.  The pio-env.sh file has references to them commented out and
the PIO install page makes mention of skipping them if not using them.  So,
I didn’t install them.



When I tried executing PredictionIO-0.14.0\bin\pio eventserver & command
from the command line, I got this error

'PredictionIO-0.14.0\bin\pio' is not recognized as an internal or external
command, operable program or batch file.



Oops.  I think my assumption PIO runs on Windows is bad.  I want to confirm
it’s not something I overlooked.



-S


Why not a Top Level Project?

2019-04-08 Thread Pat Ferrel
To slightly over simplify, all it takes to be a TLP for Apache is:
1) clear community support
2) a couple Apache members to sponsor (Incubator members help)
3) demonstrated processes that follow the Apache way
4) the will of committers and PMC to move to TLP

What is missing in Livy?

I am starting to use Livy but like anyone who sees the “incubator” will be
overly cautious. There is a clear need for this project beyond the use
cases mentioned. For instance we have a Machine Learning Server that tries
to be compute engine neutral but practically speaking uses Spark and HDFS
for several algorithms. We would have a hard time scaling a service that
runs the Spark Driver in the server process. The solution may well be Livy.

Here’s hoping Livy becomes a TLP

- Pat


Re: make-distribution.sh fails

2019-04-03 Thread Pat Ferrel
MacOS


From: Jonathan Barlow  
Reply: user@predictionio.apache.org 

Date: April 2, 2019 at 8:30:25 PM
To: user@predictionio.apache.org 

Subject:  Re: make-distribution.sh fails

What operating system are you using? Could be differences in sbt options

On Tue, Apr 2, 2019 at 3:59 PM Pat Ferrel  wrote:

> Trying to create the PIO 0.14.0 binary from the official source tarball, I
> geet the following odd error.
>
> Maclaurin:apache-predictionio-0.14.0 pat$ ./make-distribution.sh
> Building binary distribution for PredictionIO 0.14.0...
> + sbt/sbt clean
> [info] Loading settings for project apache-predictionio-0-14-0-build from
> assembly.sbt,plugins.sbt,unidoc.sbt ...
> [info] Loading project definition from
> /Users/pat/apache-predictionio-0.14.0/project
> [info] Loading settings for project root from build.sbt ...
> [info] Loading settings for project assembly from build.sbt ...
> [info] Loading settings for project dataElasticsearch from build.sbt ...
> [info] Loading settings for project tools from build.sbt ...
> [info] Loading settings for project e2 from build.sbt ...
> [info] Loading settings for project core from build.sbt ...
> [info] Loading settings for project data from build.sbt ...
> [info] Loading settings for project common from build.sbt ...
> [info] Loading settings for project dataS3 from build.sbt ...
> [info] Loading settings for project dataLocalfs from build.sbt ...
> [info] Loading settings for project dataJdbc from build.sbt ...
> [info] Loading settings for project dataHdfs from build.sbt ...
> [info] Loading settings for project dataHbase from build.sbt ...
> [info] Loading settings for project dataElasticsearch1 from build.sbt ...
> [info] Set current project to apache-predictionio-parent (in build
> file:/Users/pat/apache-predictionio-0.14.0/)
> [success] Total time: 0 s, completed Apr 2, 2019 1:19:20 PM
> [error] Expected symbol
> [error] Not a valid command: -
> [error] Expected end of input.
> [error] Expected '--'
> [error] Expected 'debug'
> [error] Expected 'info'
> [error] Expected 'warn'
> [error] Expected 'error'
> [error] Expected 'addPluginSbtFile'
> [error] -Xms1024M
> [error]  ^
>
>


make-distribution.sh fails

2019-04-02 Thread Pat Ferrel
Trying to create the PIO 0.14.0 binary from the official source tarball, I
geet the following odd error.

Maclaurin:apache-predictionio-0.14.0 pat$ ./make-distribution.sh
Building binary distribution for PredictionIO 0.14.0...
+ sbt/sbt clean
[info] Loading settings for project apache-predictionio-0-14-0-build from
assembly.sbt,plugins.sbt,unidoc.sbt ...
[info] Loading project definition from
/Users/pat/apache-predictionio-0.14.0/project
[info] Loading settings for project root from build.sbt ...
[info] Loading settings for project assembly from build.sbt ...
[info] Loading settings for project dataElasticsearch from build.sbt ...
[info] Loading settings for project tools from build.sbt ...
[info] Loading settings for project e2 from build.sbt ...
[info] Loading settings for project core from build.sbt ...
[info] Loading settings for project data from build.sbt ...
[info] Loading settings for project common from build.sbt ...
[info] Loading settings for project dataS3 from build.sbt ...
[info] Loading settings for project dataLocalfs from build.sbt ...
[info] Loading settings for project dataJdbc from build.sbt ...
[info] Loading settings for project dataHdfs from build.sbt ...
[info] Loading settings for project dataHbase from build.sbt ...
[info] Loading settings for project dataElasticsearch1 from build.sbt ...
[info] Set current project to apache-predictionio-parent (in build
file:/Users/pat/apache-predictionio-0.14.0/)
[success] Total time: 0 s, completed Apr 2, 2019 1:19:20 PM
[error] Expected symbol
[error] Not a valid command: -
[error] Expected end of input.
[error] Expected '--'
[error] Expected 'debug'
[error] Expected 'info'
[error] Expected 'warn'
[error] Expected 'error'
[error] Expected 'addPluginSbtFile'
[error] -Xms1024M
[error]  ^


Re: Wrong FS: file:/home/aml/ur/engine.json expected: hdfs://localhost:9000

2019-03-29 Thread Pat Ferrel
Templates have their own build.sbt. This means that if you upgrade a version of 
PIO you need to upgrade the dependencies in ALL your templates. So what you are 
calling a regression may just be that the UR needs to have upgraded 
dependencies.

I’d be interested in helping but let’s move back to PIO 0.14.0 first. When you 
build PIO what is the exact command line?
 

From: Michael Zhou 
Reply: user@predictionio.apache.org 
Date: March 20, 2019 at 12:05:26 PM
To: user@predictionio.apache.org 
Subject:  Re: Wrong FS: file:/home/aml/ur/engine.json expected: 
hdfs://localhost:9000  

Update: This seems like a regression introduced by pio 0.14.0. It worked after 
I downgraded to pio 0.13.0.
In particular, I suspect this diff 
https://github.com/apache/predictionio/pull/494/files#diff-167f4e9c1445b1f87aad1dead8da208c
 to have caused the issue.
Would be better if a committer can confirm this.

On Wed, Mar 20, 2019 at 10:57 AM Michael Zhou  
wrote:
I'm trying to run the integration test for the Universal Recommender. However, 
I've been getting this error when doing "pio deploy":

2019-03-20 17:44:32,856 ERROR akka.actor.OneForOneStrategy 
[pio-server-akka.actor.default-dispatcher-2] - Wrong FS: 
file:/home/aml/ur/engine.json, expected: hdfs://localhost:9000
java.lang.IllegalArgumentException: Wrong FS: file:/home/aml/ur/engine.json, 
expected: hdfs://localhost:9000
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
        at 
org.apache.predictionio.workflow.EngineServerPluginContext$.stringFromFile(EngineServerPluginContext.scala:85)
        at 
org.apache.predictionio.workflow.EngineServerPluginContext$.apply(EngineServerPluginContext.scala:58)
        at 
org.apache.predictionio.workflow.PredictionServer.(CreateServer.scala:424)
        at 
org.apache.predictionio.workflow.CreateServer$.createPredictionServerWithEngine(CreateServer.scala:237)
        at 
org.apache.predictionio.workflow.MasterActor.createServer(CreateServer.scala:389)
        at 
org.apache.predictionio.workflow.MasterActor$$anonfun$receive$1.applyOrElse(CreateServer.scala:317)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
        at 
org.apache.predictionio.workflow.MasterActor.aroundReceive(CreateServer.scala:259)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
        at akka.actor.ActorCell.invoke(ActorCell.scala:557)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

My pio-env.sh is as follows:

SPARK_HOME=/usr/local/spark
ES_CONF_DIR=/usr/local/elasticsearch
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
HBASE_CONF_DIR=/usr/local/hbase/conf

PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=my-cluster
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch

PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs
PIO_STORAGE_SOURCES_HDFS_PATH=/models

PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=/usr/local/hbase
PIO_STORAGE_SOURCES_HBASE_HOSTS=localhost

Any help would be appreciated.

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
Thanks, are you referring to
https://github.com/spark-jobserver/spark-jobserver or the undocumented REST
job server included in Spark?


From: Jason Nerothin  
Reply: Jason Nerothin  
Date: March 28, 2019 at 2:53:05 PM
To: Pat Ferrel  
Cc: Felix Cheung 
, Marcelo
Vanzin  , user
 
Subject:  Re: spark.submit.deployMode: cluster

Check out the Spark Jobs API... it sits behind a REST service...


On Thu, Mar 28, 2019 at 12:29 Pat Ferrel  wrote:

> ;-)
>
> Great idea. Can you suggest a project?
>
> Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
> launches trivially in test apps since most uses are as a lib.
>
>
> From: Felix Cheung  
> Reply: Felix Cheung 
> 
> Date: March 28, 2019 at 9:42:31 AM
> To: Pat Ferrel  , Marcelo
> Vanzin  
> Cc: user  
> Subject:  Re: spark.submit.deployMode: cluster
>
> If anyone wants to improve docs please create a PR.
>
> lol
>
>
> But seriously you might want to explore other projects that manage job
> submission on top of spark instead of rolling your own with spark-submit.
>
>
> --
> *From:* Pat Ferrel 
> *Sent:* Tuesday, March 26, 2019 2:38 PM
> *To:* Marcelo Vanzin
> *Cc:* user
> *Subject:* Re: spark.submit.deployMode: cluster
>
> Ahh, thank you indeed!
>
> It would have saved us a lot of time if this had been documented. I know,
> OSS so contributions are welcome… I can also imagine your next comment; “If
> anyone wants to improve docs see the Apache contribution rules and create a
> PR.” or something like that.
>
> BTW the code where the context is known and can be used is what I’d call a
> Driver and since all code is copied to nodes and is know in jars, it was
> not obvious to us that this rule existed but it does make sense.
>
> We will need to refactor our code to use spark-submit it appears.
>
> Thanks again.
>
>
> From: Marcelo Vanzin  
> Reply: Marcelo Vanzin  
> Date: March 26, 2019 at 1:59:36 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: spark.submit.deployMode: cluster
>
> If you're not using spark-submit, then that option does nothing.
>
> If by "context creation API" you mean "new SparkContext()" or an
> equivalent, then you're explicitly creating the driver inside your
> application.
>
> On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
> >
> > I have a server that starts a Spark job using the context creation API.
> It DOES NOY use spark-submit.
> >
> > I set spark.submit.deployMode = “cluster”
> >
> > In the GUI I see 2 workers with 2 executors. The link for running
> application “name” goes back to my server, the machine that launched the
> job.
> >
> > This is spark.submit.deployMode = “client” according to the docs. I set
> the Driver to run on the cluster but it runs on the client, ignoring the
> spark.submit.deployMode.
> >
> > Is this as expected? It is documented nowhere I can find.
> >
>
>
> --
> Marcelo
>
> --
Thanks,
Jason


Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
;-)

Great idea. Can you suggest a project?

Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
launches trivially in test apps since most uses are as a lib.


From: Felix Cheung  
Reply: Felix Cheung  
Date: March 28, 2019 at 9:42:31 AM
To: Pat Ferrel  , Marcelo
Vanzin  
Cc: user  
Subject:  Re: spark.submit.deployMode: cluster

If anyone wants to improve docs please create a PR.

lol


But seriously you might want to explore other projects that manage job
submission on top of spark instead of rolling your own with spark-submit.


--
*From:* Pat Ferrel 
*Sent:* Tuesday, March 26, 2019 2:38 PM
*To:* Marcelo Vanzin
*Cc:* user
*Subject:* Re: spark.submit.deployMode: cluster

Ahh, thank you indeed!

It would have saved us a lot of time if this had been documented. I know,
OSS so contributions are welcome… I can also imagine your next comment; “If
anyone wants to improve docs see the Apache contribution rules and create a
PR.” or something like that.

BTW the code where the context is known and can be used is what I’d call a
Driver and since all code is copied to nodes and is know in jars, it was
not obvious to us that this rule existed but it does make sense.

We will need to refactor our code to use spark-submit it appears.

Thanks again.


From: Marcelo Vanzin  
Reply: Marcelo Vanzin  
Date: March 26, 2019 at 1:59:36 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: spark.submit.deployMode: cluster

If you're not using spark-submit, then that option does nothing.

If by "context creation API" you mean "new SparkContext()" or an
equivalent, then you're explicitly creating the driver inside your
application.

On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
>
> I have a server that starts a Spark job using the context creation API.
It DOES NOY use spark-submit.
>
> I set spark.submit.deployMode = “cluster”
>
> In the GUI I see 2 workers with 2 executors. The link for running
application “name” goes back to my server, the machine that launched the
job.
>
> This is spark.submit.deployMode = “client” according to the docs. I set
the Driver to run on the cluster but it runs on the client, ignoring the
spark.submit.deployMode.
>
> Is this as expected? It is documented nowhere I can find.
>


--
Marcelo


Re: Where does the Driver run?

2019-03-28 Thread Pat Ferrel
Thanks for the pointers. We’ll investigate.

We have been told that the “Driver” is run in the launching JVM because
deployMode = cluster is ignored if spark-submit is not used to launch.

You are saying that there is a loophole and if you use one of these client
classes there is a way to run part of the app on the cluster, and you have
seen this for Yarn?

To explain more, we create a SparkConf, and then a SparkContext, which we
pass around implicitly to functions that I would define as the Spark
Driver. It seems that if you do not use spark-submit, the entire launching
app/JVM process is considered the Driver AND is always run in client mode.

I hope your loophole pays off or we will have to do a major refactoring.


From: Jianneng Li  
Reply: Jianneng Li  
Date: March 28, 2019 at 2:03:47 AM
To: p...@occamsmachete.com  
Cc: andrew.m...@gmail.com  ,
user@spark.apache.org  ,
ak...@hacked.work  
Subject:  Re: Where does the Driver run?

Hi Pat,

The driver runs in the same JVM as SparkContext. You didn't go into detail
about how you "launch" the job (i.e. how the SparkContext is created), so
it's hard for me to guess where the driver is.

For reference, we've had success launching Spark programmatically to YARN
in cluster mode by creating a SparkConf like you did and using it to call
this class:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

I haven't tried this myself, but for standalone mode you might be able to
use this:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala

Lastly, you can always check where Spark processes run by executing ps on
the machine, i.e. `ps aux | grep java`.

Best,

Jianneng



*From:* Pat Ferrel 
*Date:* Monday, March 25, 2019 at 12:58 PM
*To:* Andrew Melo 
*Cc:* user , Akhil Das 
*Subject:* Re: Where does the Driver run?



I’m beginning to agree with you and find it rather surprising that this is
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize
code to be executed in executors to various nodes. It also seems possible
to serialize the “driver” bits of code although I’m not sure how the
boundary would be defined. All code is in the jars we pass to Spark so
until now I did not question the docs.



I see no mention of a distinction between running a driver in spark-submit
vs being programmatically launched for any of the Spark Master types:
Standalone, Yarn, Mesos, k8s.



We are building a Machine Learning Server in OSS. It has pluggable Engines
for different algorithms. Some of these use Spark so it is highly desirable
to offload driver code to the cluster since we don’t want the diver
embedded in the Server process. The Driver portion of our training workflow
could be very large indeed and so could force the scaling of the server to
worst case.



I hope someone knows how to run “Driver” code on the cluster when our
server is launching the code. So deployMode = cluster, deploy method =
programatic launch.




From: Andrew Melo  
Reply: Andrew Melo  
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?



Hi Pat,



Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible with spark-shell.
But it makes sense to me -- if you want Spark to run your application's
driver, you need to package it up and send it to the cluster manager. You
can't start spark one place and then later migrate it to the cluster. It's
also why you can't use spark-shell in cluster mode either, I think.



Cheers

Andrew



On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel  wrote:

In the GUI while the job is running the app-id link brings up logs to both
executors, The “name” link goes to 4040 of the machine that launched the
job but is not resolvable right now so the page is not shown. I’ll try the
netstat but the use of port 4040 was a good clue.



By what you say below this indicates the Driver is running on the launching
machine, the client to the Spark Cluster. This should be the case in
deployMode = client.



Can someone explain what us going on? The Evidence seems to say that
deployMode = cluster *does not work* as described unless you use
spark-submit (and I’m only guessing at that).



Further; if we don’t use spark-submit we can’t use deployMode = cluster ???




From: Akhil Das  
Reply: Akhil Das  
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?



There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.



If you think the driver is running on your master/executor nodes, login to
those machines an

Re: spark.submit.deployMode: cluster

2019-03-26 Thread Pat Ferrel
Ahh, thank you indeed!

It would have saved us a lot of time if this had been documented. I know,
OSS so contributions are welcome… I can also imagine your next comment; “If
anyone wants to improve docs see the Apache contribution rules and create a
PR.” or something like that.

BTW the code where the context is known and can be used is what I’d call a
Driver and since all code is copied to nodes and is know in jars, it was
not obvious to us that this rule existed but it does make sense.

We will need to refactor our code to use spark-submit it appears.

Thanks again.


From: Marcelo Vanzin  
Reply: Marcelo Vanzin  
Date: March 26, 2019 at 1:59:36 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: spark.submit.deployMode: cluster

If you're not using spark-submit, then that option does nothing.

If by "context creation API" you mean "new SparkContext()" or an
equivalent, then you're explicitly creating the driver inside your
application.

On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
>
> I have a server that starts a Spark job using the context creation API.
It DOES NOY use spark-submit.
>
> I set spark.submit.deployMode = “cluster”
>
> In the GUI I see 2 workers with 2 executors. The link for running
application “name” goes back to my server, the machine that launched the
job.
>
> This is spark.submit.deployMode = “client” according to the docs. I set
the Driver to run on the cluster but it runs on the client, ignoring the
spark.submit.deployMode.
>
> Is this as expected? It is documented nowhere I can find.
>


-- 
Marcelo


spark.submit.deployMode: cluster

2019-03-26 Thread Pat Ferrel
I have a server that starts a Spark job using the context creation API. It
DOES NOY use spark-submit.

I set spark.submit.deployMode = “cluster”

In the GUI I see 2 workers with 2 executors. The link for running
application “name” goes back to my server, the machine that launched the
job.

This is spark.submit.deployMode = “client” according to the docs. I set the
Driver to run on the cluster but it runs on the client, *ignoring
the spark.submit.deployMode*.

Is this as expected? It is documented nowhere I can find.


Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
I’m beginning to agree with you and find it rather surprising that this is
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize
code to be executed in executors to various nodes. It also seems possible
to serialize the “driver” bits of code although I’m not sure how the
boundary would be defined. All code is in the jars we pass to Spark so
until now I did not question the docs.

I see no mention of a distinction between running a driver in spark-submit
vs being programmatically launched for any of the Spark Master types:
Standalone, Yarn, Mesos, k8s.

We are building a Machine Learning Server in OSS. It has pluggable Engines
for different algorithms. Some of these use Spark so it is highly desirable
to offload driver code to the cluster since we don’t want the diver
embedded in the Server process. The Driver portion of our training workflow
could be very large indeed and so could force the scaling of the server to
worst case.

I hope someone knows how to run “Driver” code on the cluster when our
server is launching the code. So deployMode = cluster, deploy method =
programatic launch.


From: Andrew Melo  
Reply: Andrew Melo  
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible with spark-shell.
But it makes sense to me -- if you want Spark to run your application's
driver, you need to package it up and send it to the cluster manager. You
can't start spark one place and then later migrate it to the cluster. It's
also why you can't use spark-shell in cluster mode either, I think.

Cheers
Andrew

On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel  wrote:

> In the GUI while the job is running the app-id link brings up logs to both
> executors, The “name” link goes to 4040 of the machine that launched the
> job but is not resolvable right now so the page is not shown. I’ll try the
> netstat but the use of port 4040 was a good clue.
>
> By what you say below this indicates the Driver is running on the
> launching machine, the client to the Spark Cluster. This should be the case
> in deployMode = client.
>
> Can someone explain what us going on? The Evidence seems to say that
> deployMode = cluster *does not work* as described unless you use
> spark-submit (and I’m only guessing at that).
>
> Further; if we don’t use spark-submit we can’t use deployMode = cluster ???
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 24, 2019 at 7:45:07 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> There's also a driver ui (usually available on port 4040), after running
> your code, I assume you are running it on your machine, visit
> localhost:4040 and you will get the driver UI.
>
> If you think the driver is running on your master/executor nodes, login to
> those machines and do a
>
>netstat -napt | grep -I listen
>
> You will see the driver listening on 404x there, this won't be the case
> mostly as you are not doing Spark-submit or using the deployMode=cluster.
>
> On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:
>
>> Thanks, I have seen this many times in my research. Paraphrasing docs:
>> “in deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>>
>> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
>> with addresses that match slaves). When I look at memory usage while the
>> job runs I see virtually identical usage on the 2 Workers. This would
>> support your claim and contradict Spark docs for deployMode = cluster.
>>
>> The evidence seems to contradict the docs. I am now beginning to wonder
>> if the Driver only runs in the cluster if we use spark-submit
>>
>>
>>
>> From: Akhil Das  
>> Reply: Akhil Das  
>> Date: March 23, 2019 at 9:26:50 PM
>> To: Pat Ferrel  
>> Cc: user  
>> Subject:  Re: Where does the Driver run?
>>
>> If you are starting your "my-app" on your local machine, that's where the
>> driver is running.
>>
>> [image: image.png]
>>
>> Hope this helps.
>> <https://spark.apache.org/docs/latest/cluster-overview.html>
>>
>> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>>
>>> I have researched this for a significant amount of time and find answers
>>> that seem to be for a slightly different question than mine.
>>>
>>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>>> http://master-address:8080";, there are 2 idle 

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
In the GUI while the job is running the app-id link brings up logs to both
executors, The “name” link goes to 4040 of the machine that launched the
job but is not resolvable right now so the page is not shown. I’ll try the
netstat but the use of port 4040 was a good clue.

By what you say below this indicates the Driver is running on the launching
machine, the client to the Spark Cluster. This should be the case in
deployMode = client.

Can someone explain what us going on? The Evidence seems to say that
deployMode = cluster *does not work *as described unless you use
spark-submit (and I’m only guessing at that).

Further; if we don’t use spark-submit we can’t use deployMode = cluster ???


From: Akhil Das  
Reply: Akhil Das  
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?

There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.

If you think the driver is running on your master/executor nodes, login to
those machines and do a

   netstat -napt | grep -I listen

You will see the driver listening on 404x there, this won't be the case
mostly as you are not doing Spark-submit or using the deployMode=cluster.

On Mon, 25 Mar 2019, 01:03 Pat Ferrel,  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080";, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
2 Slaves, one of which is also Master.

Node 1 & 2 are slaves. Node 1 is where I run start-all.sh.

The machines both have 60g of free memory (leaving about 4g for the master
process on Node 1). The only constraint to the Driver and Executors is
spark.driver.memory = spark.executor.memory = 60g

BTW I would expect this to create one Executor, one Driver, and the Master
on 2 Workers.




From: Andrew Melo  
Reply: Andrew Melo  
Date: March 24, 2019 at 12:46:35 PM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>

Where/how are you starting "./sbin/start-master.sh"?

Cheers
Andrew


>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080";, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


CCEACC67-4431-4246-AEB8-60CEC0940BA9
Description: Binary data


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
2 Slaves, one of which is also Master.

Node 1 & 2 are slaves. Node 1 is where I run start-all.sh.

The machines both have 60g of free memory (leaving about 4g for the master
process on Node 1). The only constraint to the Driver and Executors is
spark.driver.memory = spark.executor.memory = 60g


From: Andrew Melo  
Reply: Andrew Melo  
Date: March 24, 2019 at 12:46:35 PM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?

Hi Pat,

On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel  wrote:

> Thanks, I have seen this many times in my research. Paraphrasing docs: “in
> deployMode ‘cluster' the Driver runs on a Worker in the cluster”
>
> When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
> with addresses that match slaves). When I look at memory usage while the
> job runs I see virtually identical usage on the 2 Workers. This would
> support your claim and contradict Spark docs for deployMode = cluster.
>
> The evidence seems to contradict the docs. I am now beginning to wonder if
> the Driver only runs in the cluster if we use spark-submit
>

Where/how are you starting "./sbin/start-master.sh"?

Cheers
Andrew


>
>
>
> From: Akhil Das  
> Reply: Akhil Das  
> Date: March 23, 2019 at 9:26:50 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: Where does the Driver run?
>
> If you are starting your "my-app" on your local machine, that's where the
> driver is running.
>
> [image: image.png]
>
> Hope this helps.
> <https://spark.apache.org/docs/latest/cluster-overview.html>
>
> On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:
>
>> I have researched this for a significant amount of time and find answers
>> that seem to be for a slightly different question than mine.
>>
>> The Spark 2.3.3 cluster is running fine. I see the GUI on “
>> http://master-address:8080";, there are 2 idle workers, as configured.
>>
>> I have a Scala application that creates a context and starts execution of
>> a Job. I *do not use spark-submit*, I start the Job programmatically and
>> this is where many explanations forks from my question.
>>
>> In "my-app" I create a new SparkConf, with the following code (slightly
>> abbreviated):
>>
>>   conf.setAppName(“my-job")
>>   conf.setMaster(“spark://master-address:7077”)
>>   conf.set(“deployMode”, “cluster”)
>>   // other settings like driver and executor memory requests
>>   // the driver and executor memory requests are for all mem on the
>> slaves, more than
>>   // mem available on the launching machine with “my-app"
>>   val jars = listJars(“/path/to/lib")
>>   conf.setJars(jars)
>>   …
>>
>> When I launch the job I see 2 executors running on the 2 workers/slaves.
>> Everything seems to run fine and sometimes completes successfully. Frequent
>> failures are the reason for this question.
>>
>> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
>> taking all cluster resources. With a Yarn cluster I would expect the
>> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
>> Master, where is the Drive part of the Job running?
>>
>> If is is running in the Master, we are in trouble because I start the
>> Master on one of my 2 Workers sharing resources with one of the Executors.
>> Executor mem + driver mem is > available mem on a Worker. I can change this
>> but need so understand where the Driver part of the Spark Job runs. Is it
>> in the Spark Master, or inside and Executor, or ???
>>
>> The “Driver” creates and broadcasts some large data structures so the
>> need for an answer is more critical than with more typical tiny Drivers.
>>
>> Thanks for you help!
>>
>
>
> --
> Cheers!
>
>


3847fb65eedb5792_0.1.1
Description: Binary data


Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
Thanks, I have seen this many times in my research. Paraphrasing docs: “in
deployMode ‘cluster' the Driver runs on a Worker in the cluster”

When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1
with addresses that match slaves). When I look at memory usage while the
job runs I see virtually identical usage on the 2 Workers. This would
support your claim and contradict Spark docs for deployMode = cluster.

The evidence seems to contradict the docs. I am now beginning to wonder if
the Driver only runs in the cluster if we use spark-submit



From: Akhil Das  
Reply: Akhil Das  
Date: March 23, 2019 at 9:26:50 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?

If you are starting your "my-app" on your local machine, that's where the
driver is running.

[image: image.png]

Hope this helps.
<https://spark.apache.org/docs/latest/cluster-overview.html>

On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel  wrote:

> I have researched this for a significant amount of time and find answers
> that seem to be for a slightly different question than mine.
>
> The Spark 2.3.3 cluster is running fine. I see the GUI on “
> http://master-address:8080";, there are 2 idle workers, as configured.
>
> I have a Scala application that creates a context and starts execution of
> a Job. I *do not use spark-submit*, I start the Job programmatically and
> this is where many explanations forks from my question.
>
> In "my-app" I create a new SparkConf, with the following code (slightly
> abbreviated):
>
>   conf.setAppName(“my-job")
>   conf.setMaster(“spark://master-address:7077”)
>   conf.set(“deployMode”, “cluster”)
>   // other settings like driver and executor memory requests
>   // the driver and executor memory requests are for all mem on the
> slaves, more than
>   // mem available on the launching machine with “my-app"
>   val jars = listJars(“/path/to/lib")
>   conf.setJars(jars)
>   …
>
> When I launch the job I see 2 executors running on the 2 workers/slaves.
> Everything seems to run fine and sometimes completes successfully. Frequent
> failures are the reason for this question.
>
> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
> taking all cluster resources. With a Yarn cluster I would expect the
> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
> Master, where is the Drive part of the Job running?
>
> If is is running in the Master, we are in trouble because I start the
> Master on one of my 2 Workers sharing resources with one of the Executors.
> Executor mem + driver mem is > available mem on a Worker. I can change this
> but need so understand where the Driver part of the Spark Job runs. Is it
> in the Spark Master, or inside and Executor, or ???
>
> The “Driver” creates and broadcasts some large data structures so the need
> for an answer is more critical than with more typical tiny Drivers.
>
> Thanks for you help!
>


--
Cheers!


ii_jtmf6k1q0.png
Description: Binary data


Re: [ERROR] [OneForOneStrategy] Wrong FS: file:/home/aml/ur/engine.json, expected: hdfs://localhost:9000

2019-03-23 Thread Pat Ferrel
The UR for PredictionIO was last tested on PIO 0.12.1. We have moved it to
our next-gen ML Server called Harness.

If you want to use it with PIO 0.14.0 you should make sure the build.sbt of
the UR has the same deps as the build.sbt for PIO. In otehr words make sure
the version numbers match. The UR in any flavor requires Elasticsearch 5.x,
ES 6.x will not work.

In due time we’ll go back and get it upgraded for PIO 0.14.0. If you are in
a hurry, you can try upgrading yourself or use PIO 0.12.1.

Not sure this is the root of your problem but will need to be addressed in
any case.


From: Michael Zhou  
Reply: Michael Zhou 

Date: March 20, 2019 at 12:46:50 PM
To: Pat Ferrel  
Cc: actionml-user 
, user@predictionio.apache.org
 
Subject:  Re: [ERROR] [OneForOneStrategy] Wrong FS:
file:/home/aml/ur/engine.json, expected: hdfs://localhost:9000

Line 84 in EngineServerPluginContext.scala seems to change how the URI gets
generated:
https://github.com/apache/predictionio/pull/494/files#diff-167f4e9c1445b1f87aad1dead8da208cR84

On Wed, Mar 20, 2019 at 12:44 PM Pat Ferrel  wrote:

> What line in that PR? It is rather large.
>
> The URI is not correct. To use HDFS and a local filesystem you would use
> file:///home/aml/ur/engine.json
>
> I’m not sure where this URI is being generated but it is being generated
> incorrectly. You should be able to pass in the full URI explicitly to use
> either hdfs or file in `pio train` and/or `pio deploy`. The default in
> 0.13.0 was to look in the directory you are cded into and read engine.json
> so this may be a broken assumption in 0.14.0. Passing an explicit location
> may solve this.
>
> BTW if the assumption is causing a bad URI to be generated, it is a bug
> and should be filed.
>
> Anyone else know a better answer?
>
>
> From: Michael Zhou 
> 
> Reply: Michael Zhou 
> 
> Date: March 20, 2019 at 12:03:25 PM
> To: Pat Ferrel  
> Cc: actionml-user 
> 
> Subject:  Re: [ERROR] [OneForOneStrategy] Wrong FS:
> file:/home/aml/ur/engine.json, expected: hdfs://localhost:9000
>
> It came from "pio deploy". See
> https://lists.apache.org/thread.html/a0d0b8241f4c24dbb75bdb7b1621c625710cab04f0a9b89c37842eed@
>  for
> more details.
> Update: Seems like a regression introduced by pio 0.14.0. It works after I
> downgraded to pio 0.13.0.
> In particular, I suspect this diff
> https://github.com/apache/predictionio/pull/494/files#diff-167f4e9c1445b1f87aad1dead8da208c
>  to
> have caused the issue.
>
> On Wed, Mar 20, 2019 at 11:59 AM Pat Ferrel  wrote:
>
>> What are you trying to do when you get this?
>>
>>
>> From: Michael Zhou 
>> 
>> Reply: Michael Zhou 
>> 
>> Date: March 20, 2019 at 6:48:23 AM
>> To: actionml-user 
>> 
>> Subject:  [ERROR] [OneForOneStrategy] Wrong FS:
>> file:/home/aml/ur/engine.json, expected: hdfs://localhost:9000
>>
>> Got this error when running the universal recommender integration test.
>> Any idea what this is?
>> --
>> You received this message because you are subscribed to the Google Groups
>> "actionml-user" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to actionml-user+unsubscr...@googlegroups.com.
>> To post to this group, send email to actionml-u...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/actionml-user/3f18a233-9c65-48f0-9495-9591932dd7a5%40googlegroups.com
>> <https://groups.google.com/d/msgid/actionml-user/3f18a233-9c65-48f0-9495-9591932dd7a5%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionml-user+unsubscr...@googlegroups.com.
> To post to this group, send email to actionml-u...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/actionml-user/CALnv21h60voPeOUXZkmPmXwkoN12r9xcAj9vzmRtqchTf-pq0g%40mail.gmail.com
> <https://groups.google.com/d/msgid/actionml-user/CALnv21h60voPeOUXZkmPmXwkoN12r9xcAj9vzmRtqchTf-pq0g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>


Where does the Driver run?

2019-03-23 Thread Pat Ferrel
I have researched this for a significant amount of time and find answers
that seem to be for a slightly different question than mine.

The Spark 2.3.3 cluster is running fine. I see the GUI on “
http://master-address:8080";, there are 2 idle workers, as configured.

I have a Scala application that creates a context and starts execution of a
Job. I *do not use spark-submit*, I start the Job programmatically and this
is where many explanations forks from my question.

In "my-app" I create a new SparkConf, with the following code (slightly
abbreviated):

  conf.setAppName(“my-job")
  conf.setMaster(“spark://master-address:7077”)
  conf.set(“deployMode”, “cluster”)
  // other settings like driver and executor memory requests
  // the driver and executor memory requests are for all mem on the
slaves, more than
  // mem available on the launching machine with “my-app"
  val jars = listJars(“/path/to/lib")
  conf.setJars(jars)
  …

When I launch the job I see 2 executors running on the 2 workers/slaves.
Everything seems to run fine and sometimes completes successfully. Frequent
failures are the reason for this question.

Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
taking all cluster resources. With a Yarn cluster I would expect the
“Driver" to run on/in the Yarn Master but I am using the Spark Standalone
Master, where is the Drive part of the Job running?

If is is running in the Master, we are in trouble because I start the
Master on one of my 2 Workers sharing resources with one of the Executors.
Executor mem + driver mem is > available mem on a Worker. I can change this
but need so understand where the Driver part of the Spark Job runs. Is it
in the Spark Master, or inside and Executor, or ???

The “Driver” creates and broadcasts some large data structures so the need
for an answer is more critical than with more typical tiny Drivers.

Thanks for you help!


Re: [ERROR] [OneForOneStrategy] Wrong FS: file:/home/aml/ur/engine.json, expected: hdfs://localhost:9000

2019-03-20 Thread Pat Ferrel
What line in that PR? It is rather large.

The URI is not correct. To use HDFS and a local filesystem you would use
file:///home/aml/ur/engine.json

I’m not sure where this URI is being generated but it is being generated
incorrectly. You should be able to pass in the full URI explicitly to use
either hdfs or file in `pio train` and/or `pio deploy`. The default in
0.13.0 was to look in the directory you are cded into and read engine.json
so this may be a broken assumption in 0.14.0. Passing an explicit location
may solve this.

BTW if the assumption is causing a bad URI to be generated, it is a bug and
should be filed.

Anyone else know a better answer?


From: Michael Zhou  
Reply: Michael Zhou 

Date: March 20, 2019 at 12:03:25 PM
To: Pat Ferrel  
Cc: actionml-user 

Subject:  Re: [ERROR] [OneForOneStrategy] Wrong FS:
file:/home/aml/ur/engine.json, expected: hdfs://localhost:9000

It came from "pio deploy". See
https://lists.apache.org/thread.html/a0d0b8241f4c24dbb75bdb7b1621c625710cab04f0a9b89c37842eed@
for
more details.
Update: Seems like a regression introduced by pio 0.14.0. It works after I
downgraded to pio 0.13.0.
In particular, I suspect this diff
https://github.com/apache/predictionio/pull/494/files#diff-167f4e9c1445b1f87aad1dead8da208c
to
have caused the issue.

On Wed, Mar 20, 2019 at 11:59 AM Pat Ferrel  wrote:

> What are you trying to do when you get this?
>
>
> From: Michael Zhou 
> 
> Reply: Michael Zhou 
> 
> Date: March 20, 2019 at 6:48:23 AM
> To: actionml-user 
> 
> Subject:  [ERROR] [OneForOneStrategy] Wrong FS:
> file:/home/aml/ur/engine.json, expected: hdfs://localhost:9000
>
> Got this error when running the universal recommender integration test.
> Any idea what this is?
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionml-user+unsubscr...@googlegroups.com.
> To post to this group, send email to actionml-u...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/actionml-user/3f18a233-9c65-48f0-9495-9591932dd7a5%40googlegroups.com
> <https://groups.google.com/d/msgid/actionml-user/3f18a233-9c65-48f0-9495-9591932dd7a5%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CALnv21h60voPeOUXZkmPmXwkoN12r9xcAj9vzmRtqchTf-pq0g%40mail.gmail.com
<https://groups.google.com/d/msgid/actionml-user/CALnv21h60voPeOUXZkmPmXwkoN12r9xcAj9vzmRtqchTf-pq0g%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.


Re: Spark with Kubernetes connecting to pod ID, not address

2019-02-13 Thread Pat Ferrel
solve(SimpleNameResolver.java:55)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
at 
io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at 
io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more



From: Erik Erlandson 
Date: February 13, 2019 at 4:57:30 AM
To: Pat Ferrel 
Subject:  Re: Spark with Kubernetes connecting to pod id, not address  

Hi Pat,

I'd suggest visiting the big data slack channel, it's a more spark oriented 
forum than kube-dev:
https://kubernetes.slack.com/messages/C0ELB338T/

Tentatively, I think you may want to submit in client mode (unless you are 
initiating your application from outside the kube cluster). When in client 
mode, you need to set up a headless service for the application driver pod that 
the executors can use to talk back to the driver.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode

Cheers,
Erik


On Wed, Feb 13, 2019 at 1:55 AM Pat Ferrel  wrote:
We have a k8s deployment of several services including Apache Spark. All 
services seem to be operational. Our application connects to the Spark master 
to submit a job using the k8s DNS service for the cluster where the master is 
called spark-api so we use master=spark://spark-api:7077 and we use 
spark.submit.deployMode=cluster. We submit the job through the API not by the 
spark-submit script. 

This will run the "driver" and all "executors" on the cluster and this part 
seems to work but there is a callback to the launching code in our app from 
some Spark process. For some reason it is trying to connect to 
harness-64d97d6d6-4r4d8, which is the pod ID, not the k8s cluster IP or DNS.

How could this pod ID be getting into the system? Spark somehow seems to think 
it is the address of the service that called it. Needless to say any connection 
to the k8s pod ID fails and so does the job.

Any idea how Spark could think the pod ID is an IP address or DNS name? 

BTW if we run a small sample job with `master=local` all is well, but the same 
job executed with the above config tries to connect to the spurious pod ID.
--
You received this message because you are subscribed to the Google Groups 
"Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to kubernetes-dev+unsubscr...@googlegroups.com.
To post to this group, send email to kubernetes-...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/kubernetes-dev/36bb6bf8-1cac-428e-8ad7-3d639c90a86b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Spark with Kubernetes connecting to pod id, not address

2019-02-12 Thread Pat Ferrel


From: Pat Ferrel 
Reply: Pat Ferrel 
Date: February 12, 2019 at 5:40:41 PM
To: user@spark.apache.org 
Subject:  Spark with Kubernetes connecting to pod id, not address  

We have a k8s deployment of several services including Apache Spark. All 
services seem to be operational. Our application connects to the Spark master 
to submit a job using the k8s DNS service for the cluster where the master is 
called `spark-api` so we use `master=spark://spark-api:7077` and we use 
`spark.submit.deployMode=cluster`. We submit the job through the API not by the 
spark-submit script. 

This will run the "driver" and all "executors" on the cluster and this part 
seems to work but there is a callback to the launching code in our app from 
some Spark process. For some reason it is trying to connect to 
`harness-64d97d6d6-4r4d8`, which is the **pod ID**, not the k8s cluster IP or 
DNS.

How could this **pod ID** be getting into the system? Spark somehow seems to 
think it is the address of the service that called it. Needless to say any 
connection to the k8s pod ID fails and so does the job.

Any idea how Spark could think the **pod ID** is an IP address or DNS name? 

BTW if we run a small sample job with `master=local` all is well, but the same 
job executed with the above config tries to connect to the spurious pod ID.

BTW2 the pod launching the Spark job has the k8s DNS name "harness-api” not 
sure if this matters

Thanks in advance


Re: [NOTICE] Mandatory migration of git repositories to gitbox.apache.org

2019-01-03 Thread Pat Ferrel
+1


From: Apache Mahout 
Reply: dev@mahout.apache.org 
Date: January 3, 2019 at 11:53:02 AM
To: dev 
Subject:  Re: [NOTICE] Mandatory migration of git repositories to 
gitbox.apache.org  

👍  

On Thu, 3 Jan 2019 13:51:40 -0600, dev wrote:  

Cool, just making sure we needed it.  

On Thu, Jan 3, 2019 at 1:48 PM Apache Mahout mahout.sh...@gmail.com wrote:  

Trevor, yes form the Notice, a consensus is necessary: • Ensure consensus  
on the move (a link to a lists.apache.org thread will suffice for us as  
evidence).  

On Thu, 3 Jan 2019 19:39:25 +, dev wrote:  

+1  

On 1/3/19, 2:31 PM, "Andrew Palumbo" ap@outlook.com wrote:  

I'd like to call a vote on moving to gitbox. Here's my +1  


Re: Multiple Engines on same Server leveraging same EventStore?

2018-12-01 Thread Pat Ferrel
IMO Multi-tenancy is a big missing feature in PIO. It is doable but clumsy.
To share a dataset is easy by using the same “appname” in your template to
draw data from. To create more than one PredictionServer is also possible
as 2 or more servers on different ports—see the `pio deploy` command params
for this. This will give you 3 servers, one for the EventsStore, which IS
multi-tenant, and 2 for PredictionServers. There will be 3 endpoints on the
same “machine”.

“app”s are really datasets in PIO and the EventServer can host several IDed
by the appname and key generated for access. It sounds like you want only
one dataset. PredictionServers manage one model per server and so an not
multi-tenant but you can have more than one.

BTW what template Engine are you using?


From: Shane Johnson  
Reply: user@predictionio.apache.org 

Date: November 30, 2018 at 5:03:55 PM
To: user@predictionio.apache.org 

Subject:  Multiple Engines on same Server leveraging same EventStore?

Hi team. We are experimenting with using the same code base but querying
different events from the same event store. Has anyone trained and deployed
two different models leveraging the same event store and thus creating two
different endpoints for predictions on the same linux box.

Perhaps this is not aligned with the current architecture. What we are
trying to avoid is setting up a whole new set of infrastructure for a
different model that is using the same events.

Can someone remind me the purpose for setting up and defining different
Apps? Perhaps we can accomplish what we are trying to do with setting up
different apps.

Any ideas and experience are greatly appreciated.

*Shane Johnson | LIFT IQ*
*Founder | CEO*

*www.liftiq.com * or *sh...@liftiq.com
*
mobile: (801) 360-3350
LinkedIn   |  Twitter
 |  Facebook



Re: universal recommender version

2018-11-27 Thread Pat Ferrel
There is a tag v0.7.3 and yes it is in master:

https://github.com/actionml/universal-recommender/tree/v0.7.3


From: Marco Goldin 
Reply: user@predictionio.apache.org 
Date: November 20, 2018 at 6:56:39 AM
To: user@predictionio.apache.org , 
gyar...@griddynamics.com 
Subject:  Re: universal recommender version  

Hi George, most recent current stable release is 0.7.3, which is simply in the 
branch master, that's why you don't see a 0.7.3 tag.
Git download the master and you'll be fine.
If you check the build.sbt in master you'll see specs as:

version := "0.7.3"
scalaVersion := "2.11.11"

that's the one you're looking for. 

Il giorno mar 20 nov 2018 alle ore 15:47 George Yarish 
 ha scritto:
Hi,

Can please some one advise what is the most recent current release version of 
universal recommender and where it is source code located?

According to GitHub project https://github.com/actionml/universal-recommender 
branches it is v0.8.0 (this branch looks bit outdated)
but according to documentation https://actionml.com/docs/ur_version_log
it is 0.7.3 which can't be found in GitHub repo. 

Thanks,
George

[jira] [Commented] (PIO-31) Move from spray to akka-http in servers

2018-09-19 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIO-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621518#comment-16621518
 ] 

Pat Ferrel commented on PIO-31:
---

+1

> Move from spray to akka-http in servers
> ---
>
> Key: PIO-31
> URL: https://issues.apache.org/jira/browse/PIO-31
> Project: PredictionIO
>  Issue Type: Improvement
>  Components: Core
>Reporter: Marcin Ziemiński
>Priority: Major
>  Labels: gsoc2017, newbie
>
> On account of the death of spray for http and it being reborn as akka-http we 
> should update EventServer and Dashbord. It should be fairly simple, as 
> described in the following guide: 
> http://doc.akka.io/docs/akka/2.4/scala/http/migration-from-spray.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIO-31) Move from spray to akka-http in servers

2018-09-19 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/PIO-31?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621051#comment-16621051
 ] 

Pat Ferrel commented on PIO-31:
---

I assume we are talking about the Event Server and the query server both, and 
dropping Spray completely. +1 to that.

> Move from spray to akka-http in servers
> ---
>
> Key: PIO-31
> URL: https://issues.apache.org/jira/browse/PIO-31
> Project: PredictionIO
>  Issue Type: Improvement
>  Components: Core
>Reporter: Marcin Ziemiński
>Priority: Major
>  Labels: gsoc2017, newbie
>
> On account of the death of spray for http and it being reborn as akka-http we 
> should update EventServer and Dashbord. It should be fairly simple, as 
> described in the following guide: 
> http://doc.akka.io/docs/akka/2.4/scala/http/migration-from-spray.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Logrotate

2018-09-18 Thread Pat Ferrel
You are probably seeing lots of “info” logs so the best way is to reduce
them to at least warning level by changing the PIO logger properties, then
use some external tool for log rotation (Google will help here).

Info level is for when you are setting things up. It helps us give you
advice if something is not working well. Once you are past that only
warnings or errors make sense to watch for.


From: József Hábit  
Reply: user@predictionio.apache.org 

Date: September 10, 2018 at 6:22:27 AM
To: user@predictionio.apache.org 

Subject:  Logrotate

Hello,

what is the proper way to rotate the log files generated by Pio/UR?

Thanks in advance!
Jozsef Habit


Re: PIO train issue

2018-08-29 Thread Pat Ferrel
Assuming your are using the UR…

I don’t know how many times this has been caused by a misspelling of
eventNames in engine.json but assume you have checked that.

The fail-safe way to check is to `pio export` your data and check it
against your engine.json.

BTW `pio status` does not even try to check all services. Run `pio app
list` to see if the right appnames (dataset names) are in the EventServer,
which will check hbase, hdfs, and elasticsearch. Then check to see you have
Spark. Elasticsearch and HDFS running—if you have set them to run in remote
standalone mode.


From: bala vivek  
Date: August 29, 2018 at 8:43:05 AM
To: actionml-user 
, user@predictionio.apache.org
 
Subject:  PIO train issue

Hi PIO users,

I'm using the PIO 0.10 version for a long time. I recently moved the
working setup of PIO to CentOS from Ubuntu and it seems to work fine when I
checked the PIO status, It shows all the services are up and working.
But while doing a PIO train I could see "Data set is empty" error, I have
cross checked and saw the hbase table manually by scanning the tables and
the records are present inside the event table. To cross verify I tried to
do a Curl with the help of access key for a particular app and the response
to it is "http200.ok"  so it's confirmed the app id or a particular app has
the data.
But if I run the command pio train manually it's not training and the
model. The engine file has no issues as the appname also given correctly.
It always shows "Data set is empty". This same setup is working fine with
Ubuntu 14 version. I havent made any config changes to make it run in
centos.

Let me know what will be the reason for this issue as the data is present
in Hbase but the PIO engine fails to detect it.

Thanks
Bala
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CABdDaRqqpGcPb%3DZD-ms6i5OzY8_JdLQ3YbbcapS_dS8TxkGidQ%40mail.gmail.com

.
For more options, visit https://groups.google.com/d/optout.


Re: Distinct recommendation from "random" backfill?

2018-08-28 Thread Pat Ferrel
The random ranking is assigned after every `pio train` so if you have not
trained in-between, they will be the same. Random is not really meant to do
what you are using it for, it is meant to surface items with no data—no
primary events. This will allow some to get real events and be recommended
for the events next time you train. It is meant to fill in when you ask for
20 recs but there are only 10 things to be recommended. Proper use of this
with frequent training will cause items with no data to be purchased and to
therefore get data. The reason rankings are assigned at train time is that
this is the only way to get all of the business rules applied to the query
as well as a random ranking. In other words the ranking must be built into
the model with `pio train`

If you want to recommend random items each time you query, create a list of
item ids from your catalog and return some random sample each query
yourself. This should be nearly trivial.


From: Brian Chiu  
Reply: user@predictionio.apache.org 

Date: August 28, 2018 at 1:51:24 AM
To: u...@predictionio.incubator.apache.org


Subject:  Distinct recommendation from "random" backfill?

Dear pio developers and users:

I have been using predictionIO and Universal Recommender for a while.
In universal recommender engiene.json, there is a configuration field
`rankings`, and one of the option is random. Initially I thought it
would give each item without any related event some random recommended
items, and each of the recommendation list is different. However, it
turns out all of the random recommended item list is the same. For
example, if both item "6825991" and item "682599" have no events
during training, the result will be

```
$ curl -H "Content-Type: application/json" -d '{ "item": "6825991" }'
http://localhost:8000/queries.json
{"itemScores":[{"item":"8083748","score":0.0},{"item":"7942100","score":0.0},{"item":"8016271","score":0.0},{"item":"7731061","score":0.0},{"item":"8002458","score":0.0},{"item":"7763317","score":0.0},{"item":"8141119","score":0.0},{"item":"8080694","score":0.0},{"item":"7994844","score":0.0},{"item":"7951667","score":0.0},{"item":"7948453","score":0.0},{"item":"8148479","score":0.0},{"item":"8113083","score":0.0},{"item":"8041124","score":0.0},{"item":"8004823","score":0.0},{"item":"8126058","score":0.0},{"item":"8093042","score":0.0},{"item":"8064036","score":0.0},{"item":"8022524","score":0.0},{"item":"7977131","score":0.0}]}

$ curl -H "Content-Type: application/json" -d '{ "item": "682599" }'
http://localhost:8000/queries.json
{"itemScores":[{"item":"8083748","score":0.0},{"item":"7942100","score":0.0},{"item":"8016271","score":0.0},{"item":"7731061","score":0.0},{"item":"8002458","score":0.0},{"item":"7763317","score":0.0},{"item":"8141119","score":0.0},{"item":"8080694","score":0.0},{"item":"7994844","score":0.0},{"item":"7951667","score":0.0},{"item":"7948453","score":0.0},{"item":"8148479","score":0.0},{"item":"8113083","score":0.0},{"item":"8041124","score":0.0},{"item":"8004823","score":0.0},{"item":"8126058","score":0.0},{"item":"8093042","score":0.0},{"item":"8064036","score":0.0},{"item":"8022524","score":0.0},{"item":"7977131","score":0.0}]}

```

But I my webpage, whenever user click on these products without
events, they will see exactly the same recommended items, making it
looks boring. Is there anyway to give each item distinct random list?
Even if it is generated dynamically is OK. If you have any other
alternative, please also tell me.

Thanks all developers!

Best Regards,
Brian


Why are these going to the incubator address?

2018-08-24 Thread Pat Ferrel
Is it necessary these commits are going to the incubator list? Are
notifications setup wrong?


From: git-site-r...@apache.org 

Reply: dev@predictionio.apache.org 

Date: August 24, 2018 at 10:33:34 AM
To: comm...@predictionio.incubator.apache.org


Subject:  [7/7] predictionio-site git commit: Documentation based on <

comm...@predictionio.incubator.apache.org>

apache/predictionio#fc481c9c82989e1b484ea5bfeb540bc96758bed5

Documentation based on
apache/predictionio#fc481c9c82989e1b484ea5bfeb540bc96758bed5


Project: http://git-wip-us.apache.org/repos/asf/predictionio-site/repo
Commit:
http://git-wip-us.apache.org/repos/asf/predictionio-site/commit/107116dc
Tree: http://git-wip-us.apache.org/repos/asf/predictionio-site/tree/107116dc
Diff: http://git-wip-us.apache.org/repos/asf/predictionio-site/diff/107116dc

Branch: refs/heads/asf-site
Commit: 107116dc22c9d6c5467ba0c1506c61b6a9e10e32
Parents: c17b960
Author: jenkins 
Authored: Fri Aug 24 17:33:22 2018 +
Committer: jenkins 
Committed: Fri Aug 24 17:33:22 2018 +

--
datacollection/batchimport/index.html | 4 +-
datacollection/channel/index.html | 6 +-
datacollection/eventapi/index.html | 20 +-
gallery/template-gallery/index.html | 2 +-
gallery/templates.yaml | 14 +
samples/tabs/index.html | 18 +-
sitemap.xml | 260 +--
templates/classification/quickstart/index.html | 30 +--
.../complementarypurchase/quickstart/index.html | 20 +-
.../quickstart/index.html | 60 ++---
.../quickstart/index.html | 60 ++---
templates/leadscoring/quickstart/index.html | 30 +--
templates/productranking/quickstart/index.html | 40 +--
templates/recommendation/quickstart/index.html | 30 +--
templates/similarproduct/quickstart/index.html | 40 +--
templates/vanilla/quickstart/index.html | 10 +-
16 files changed, 329 insertions(+), 315 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/predictionio-site/blob/107116dc/datacollection/batchimport/index.html
--
diff --git a/datacollection/batchimport/index.html
b/datacollection/batchimport/index.html
index b4c83b7..4b739a3 100644
--- a/datacollection/batchimport/index.html
+++ b/datacollection/batchimport/index.html
@@ -7,7 +7,7 @@
{"event":"rate","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"2","properties":{"rating":1.0},"eventTime":"2014-11-21T01:04:14.729Z"}
{"event":"buy","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"7","eventTime":"2014-11-21T01:04:14.735Z"}
{"event":"buy","entityType":"user","entityId":"3","targetEntityType":"item","targetEntityId":"8","eventTime":"2014-11-21T01:04:14.741Z"}
-  Please make sure your import file does not contain any empty
lines. Empty lines will be treated as a null object and will return an
error during import.Use SDK to Prepare Batch Input FileSome of
the Apache PredictionIO SDKs also provides FileExporter client. You may use
them to prepare the JSON file as described above. The FileExporter creates
event in the same way as EventClient except that the events are written to
a JSON file instead of being sent to EventSever. The written JSON file can
then be used by batch import. 
PHP
SDK Python SDK 
Ruby SDK Java SDK 
 (coming soon) 1
+  Please make sure your import file does not contain any empty
lines. Empty lines will be treated as a null object and will return an
error during import.Use SDK to Prepare Batch Input FileSome of
the Apache PredictionIO SDKs also provides FileExporter client. You may use
them to prepare the JSON file as described above. The FileExporter creates
event in the same way as EventClient except that the events are written to
a JSON file instead of being sent to EventSever. The written JSON file can
then be used by batch import. 
PHP
SDK Python SDK 
Ruby SDK Java SDK 
 (coming soon) 1
2
3
4
@@ -58,7 +58,7 @@
# close the FileExporter when finish writing all
events
exporter.close()

- (coming
soon)   
 1 (coming
soon)
+ (coming
soon)   
 1 (coming
soon)
 Import Events
from Input FileImporting events from a file can be done easily
using the command line interface. Assuming that pio be in your
search path, your App ID be 123, and the input file
my_events.json be in your current working directory:1$
pio import --appid 123 --input my_events.json
  After a brief while, the tool
should return to the console without any error. Congratulations! You have
successfully imported your events.CommunityDownloadDocsGitHubSubscribe to User
Mailing ListStackoverflowContributeContributeSource CodeBug
Trackermailto:dev-subscr...@predictionio.apache.org";
target="blank">Subscribe to Development Mailing
ListApache PredictionIO, PredictionIO, Apache, the
Apache feather logo, and the Apache PredictionIO project logo are either
registered trademarks or trademarks of The Apache Software Founda

Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel
Oh and no it does not need a new context for every query, only for the
deploy.


From: Pat Ferrel  
Date: August 7, 2018 at 10:00:49 AM
To: Ulavapalle Meghamala 

Cc: user@predictionio.apache.org 
, actionml-user
 
Subject:  Re: PredictionIO spark deployment in Production

The answers to your question illustrate why IMHO it is bad to have Spark
required for predictions.

Any of the MLlib ALS recommenders use Spark to predict and so run Spark
during the time they are deployed.. They can use one machine or use the
entire cluster. This is one case where using the cluster slows down
predictions since part of the model may be spread across nodes. Spark is
not designed to scale in this manner for real-time queries but I believe
those are your options out of the box for the ALS recommenders.

To be both fast and scalable you would load the model entirely into memory
on one machine for fast queries then spread queries across many identical
machines to scale load. I don’t think any templates do this—it requires a
load balancer at very least, not to mention custom deployment code that
interferes with using the same machines for training.

The UR loads the model into Elasticsearch for serving independently
scalable queries.

I always advise you keep Spark out of serving for the reasons mentioned
above.


From: Ulavapalle Meghamala 

Date: August 7, 2018 at 9:27:46 AM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 
, actionml-user
 
Subject:  Re: PredictionIO spark deployment in Production

Thanks Pat for getting back.

Are there any PredictionIO models/templates which really use Spark in "pio
deploy" ? (not just loading the Spark Context for loading the 'pio deploy'
driver and then dropping the Spark Context), but a running Spark Context
through out the Prediction Server life cycle ? Or How does Prediction IO
handle this case ? Does it create a new Spark Context every time a
prediction has to be done ?

Also, in the production deployments(where Spark is not really used), how do
you scale Prediction Server ? Do you just deploy same model on multiple
machines and have a LB/HA Proxy to handle requests?

Thanks,
Megha



On Tue, Aug 7, 2018 at 9:35 PM, Pat Ferrel  wrote:

> PIO is designed to use Spark in train and deploy. But the Universal
> Recommender removes the need for Spark to make predictions. This IMO is a
> key to use Spark well—remove it from serving results. PIO creates a Spark
> context to launch the `pio deploy' driver but Spark is never used and the
> context is dropped.
>
> The UR also does not need to be re-deployed after each train. It hot swaps
> the new model into use outside of Spark and so if you never shut down the
>  PredictionServer you never need to re-deploy.
>
> The confusion comes from reading Apache PIO docs which may not do things
> this way—don’t read them. Each template defines it’s own requirements. To
> use the UR stick with it’s documentation.
>
> That means Spark is used to “train” only and you never re-deploy. Deploy
> once—train periodically.
>
>
> From: Ulavapalle Meghamala 
> 
> Reply: user@predictionio.apache.org 
> 
> Date: August 7, 2018 at 4:13:39 AM
> To: user@predictionio.apache.org 
> 
> Subject:  PredictionIO spark deployment in Production
>
> Hi,
>
> Are there any templates in PredictionIO where "spark" is used even in "pio
> deploy" ? How are you handling such cases ? Will you create a spark context
> every time you run a prediction ?
>
> I have gone through then documentation here: http://actionml.com/docs/
> single_driver_machine. But, it only talks about "pio train". Please guide
> me to any documentation that is available on the "pio deploy" with spark ?
>
> Thanks,
> Megha
>
>
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CAOtZQD-KRpqz-Po6%3D%2BL2WhUh7kKa64yGihP44iSNdqb9nFE0Dg%40mail.gmail.com
<https://groups.google.com/d/msgid/actionml-user/CAOtZQD-KRpqz-Po6%3D%2BL2WhUh7kKa64yGihP44iSNdqb9nFE0Dg%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.


Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel
The answers to your question illustrate why IMHO it is bad to have Spark
required for predictions.

Any of the MLlib ALS recommenders use Spark to predict and so run Spark
during the time they are deployed.. They can use one machine or use the
entire cluster. This is one case where using the cluster slows down
predictions since part of the model may be spread across nodes. Spark is
not designed to scale in this manner for real-time queries but I believe
those are your options out of the box for the ALS recommenders.

To be both fast and scalable you would load the model entirely into memory
on one machine for fast queries then spread queries across many identical
machines to scale load. I don’t think any templates do this—it requires a
load balancer at very least, not to mention custom deployment code that
interferes with using the same machines for training.

The UR loads the model into Elasticsearch for serving independently
scalable queries.

I always advise you keep Spark out of serving for the reasons mentioned
above.


From: Ulavapalle Meghamala 

Date: August 7, 2018 at 9:27:46 AM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 
, actionml-user
 
Subject:  Re: PredictionIO spark deployment in Production

Thanks Pat for getting back.

Are there any PredictionIO models/templates which really use Spark in "pio
deploy" ? (not just loading the Spark Context for loading the 'pio deploy'
driver and then dropping the Spark Context), but a running Spark Context
through out the Prediction Server life cycle ? Or How does Prediction IO
handle this case ? Does it create a new Spark Context every time a
prediction has to be done ?

Also, in the production deployments(where Spark is not really used), how do
you scale Prediction Server ? Do you just deploy same model on multiple
machines and have a LB/HA Proxy to handle requests?

Thanks,
Megha



On Tue, Aug 7, 2018 at 9:35 PM, Pat Ferrel  wrote:

> PIO is designed to use Spark in train and deploy. But the Universal
> Recommender removes the need for Spark to make predictions. This IMO is a
> key to use Spark well—remove it from serving results. PIO creates a Spark
> context to launch the `pio deploy' driver but Spark is never used and the
> context is dropped.
>
> The UR also does not need to be re-deployed after each train. It hot swaps
> the new model into use outside of Spark and so if you never shut down the
>  PredictionServer you never need to re-deploy.
>
> The confusion comes from reading Apache PIO docs which may not do things
> this way—don’t read them. Each template defines it’s own requirements. To
> use the UR stick with it’s documentation.
>
> That means Spark is used to “train” only and you never re-deploy. Deploy
> once—train periodically.
>
>
> From: Ulavapalle Meghamala 
> 
> Reply: user@predictionio.apache.org 
> 
> Date: August 7, 2018 at 4:13:39 AM
> To: user@predictionio.apache.org 
> 
> Subject:  PredictionIO spark deployment in Production
>
> Hi,
>
> Are there any templates in PredictionIO where "spark" is used even in "pio
> deploy" ? How are you handling such cases ? Will you create a spark context
> every time you run a prediction ?
>
> I have gone through then documentation here: http://actionml.com/docs/
> single_driver_machine. But, it only talks about "pio train". Please guide
> me to any documentation that is available on the "pio deploy" with spark ?
>
> Thanks,
> Megha
>
>


Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel
PIO is designed to use Spark in train and deploy. But the Universal
Recommender removes the need for Spark to make predictions. This IMO is a
key to use Spark well—remove it from serving results. PIO creates a Spark
context to launch the `pio deploy' driver but Spark is never used and the
context is dropped.

The UR also does not need to be re-deployed after each train. It hot swaps
the new model into use outside of Spark and so if you never shut down the
 PredictionServer you never need to re-deploy.

The confusion comes from reading Apache PIO docs which may not do things
this way—don’t read them. Each template defines it’s own requirements. To
use the UR stick with it’s documentation.

That means Spark is used to “train” only and you never re-deploy. Deploy
once—train periodically.


From: Ulavapalle Meghamala 

Reply: user@predictionio.apache.org 

Date: August 7, 2018 at 4:13:39 AM
To: user@predictionio.apache.org 

Subject:  PredictionIO spark deployment in Production

Hi,

Are there any templates in PredictionIO where "spark" is used even in "pio
deploy" ? How are you handling such cases ? Will you create a spark context
every time you run a prediction ?

I have gone through then documentation here:
http://actionml.com/docs/single_driver_machine. But, it only talks about
"pio train". Please guide me to any documentation that is available on the
"pio deploy" with spark ?

Thanks,
Megha


Re: 2 pio servers with 1 event server

2018-08-02 Thread Pat Ferrel
Check to see if you have the same indexName in both engine.json files. This 
will cause the 2 engines to use the same index in Elasticsearch, so the newest 
train will overwrite the previous one. To keep them separate use separate 
indexNames.


From: Sami Serbey 
Reply: user@predictionio.apache.org 
Date: August 2, 2018 at 1:02:48 PM
To: Pat Ferrel , user@predictionio.apache.org 

Subject:  Re: 2 pio servers with 1 event server  

I am using the universal recommender template

Get Outlook for iOS
From: Pat Ferrel 
Sent: Thursday, August 2, 2018 7:59:20 PM
To: user@predictionio.apache.org; Sami Serbey
Subject: Re: 2 pio servers with 1 event server
 
What template?


From: Sami Serbey 
Reply: user@predictionio.apache.org 
Date: August 2, 2018 at 9:08:05 AM
To: user@predictionio.apache.org 
Subject:  2 pio servers with 1 event server

Greetings,

I am trying to run 2 pio servers on different ports where each server have his 
own app. When I deploy the first server, I get the results I want for 
prediction on that server. However, after deplying the second server on a 
different port, the results from the first server got changed. Any idea on how 
can I fix that?

Or is there some kind of procedures I should follow to be able to run 2 
prediction servers from 2 different app but share the same template?

Regards,
Sami serbey

Re: Straw poll: deprecating Scala 2.10 and Spark 1.x support

2018-08-02 Thread Pat Ferrel
+1


From: takako shimamoto  
Reply: u...@predictionio.apache.org 

Date: August 2, 2018 at 2:55:49 AM
To: dev@predictionio.apache.org 
, u...@predictionio.apache.org
 
Subject:  Straw poll: deprecating Scala 2.10 and Spark 1.x support

Hi all,

We're considering deprecating Scala 2.10 and Spark 1.x as of
the next release. Our intent is that using deprecated versions
can generate warnings, but that it should still work.

Nothing is concrete about actual removal of support at the moment, but
moving forward, use of Scala 2.11 and Spark 2.x will be recommended.
I think it's time to plan to deprecate 2.10 support, especially
with 2.12 coming soon.

This has an impact on some users, so if you see any issues with this,
please let us know as soon as possible.

Regards,
Takako


Re: Straw poll: deprecating Scala 2.10 and Spark 1.x support

2018-08-02 Thread Pat Ferrel
+1


From: takako shimamoto  
Reply: user@predictionio.apache.org 

Date: August 2, 2018 at 2:55:49 AM
To: d...@predictionio.apache.org 
, user@predictionio.apache.org
 
Subject:  Straw poll: deprecating Scala 2.10 and Spark 1.x support

Hi all,

We're considering deprecating Scala 2.10 and Spark 1.x as of
the next release. Our intent is that using deprecated versions
can generate warnings, but that it should still work.

Nothing is concrete about actual removal of support at the moment, but
moving forward, use of Scala 2.11 and Spark 2.x will be recommended.
I think it's time to plan to deprecate 2.10 support, especially
with 2.12 coming soon.

This has an impact on some users, so if you see any issues with this,
please let us know as soon as possible.

Regards,
Takako


Re: 2 pio servers with 1 event server

2018-08-02 Thread Pat Ferrel
What template?


From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: August 2, 2018 at 9:08:05 AM
To: user@predictionio.apache.org 

Subject:  2 pio servers with 1 event server

Greetings,

I am trying to run 2 pio servers on different ports where each server have
his own app. When I deploy the first server, I get the results I want for
prediction on that server. However, after deplying the second server on a
different port, the results from the first server got changed. Any idea on
how can I fix that?

Or is there some kind of procedures I should follow to be able to run 2
prediction servers from 2 different app but share the same template?

Regards,
Sami serbey


Re: Increase heap size for pio deploy

2018-07-26 Thread Pat Ferrel
Depending on the template you are using the driver and executor memory will 
increase as your data increases. Spark keeps data in memory to get the speed 
increase over something like Hadoop MapReduce by using memory instead of temp 
files. This yields orders of magnitude speed increases but does mean with big 
data PIO and Spark (more specifically) is a memory hog—by design. The memory 
requirements will be far larger than you are used to with DBs or other 
services. The good thing about Spark is that the data can be spread over 
members of a cluster so if you need a 100g data structure in-memory you can put 
10g on each executor—or something like this and the data structures may only be 
loosely linked to the sixe of your input.

TLDR; Experiment to find the driver and executor memory required to run train 
and deploy of your template. For instance the Universal Recommender will need a 
lot of train memory but almost no deploy memory because it does not use Spark 
for deploy. Other templates may need more memory for deploy. Unfortunately the 
template and algorithm greatly affect these numbers and there is generally no 
way but experiment to determine them.


From: George Yarish 
Reply: user@predictionio.apache.org 
Date: July 26, 2018 at 5:51:44 AM
To: user@predictionio.apache.org 
Subject:  Re: Increase heap size for pio deploy  

ok solved by --driver-memory 10g

Sorry for bothering,
George


On Thu, Jul 26, 2018 at 3:25 PM, George Yarish  wrote:
Hi!

Can someone please advise me how to setup java heap size properties for pio 
deploy process?

My current issue is "[ERROR] [LocalFSModels]  Java heap space" during pio 
deploy.
My model takes ~350mb on localfs in model store. 

I was trying something like "JAVA_OPTS=-Xmx4g pio deploy" doesn't work for me. 

Thanks,
George



Re: [actionml/universal-recommender] Boosting categories only shows one category type (#55)

2018-07-06 Thread Pat Ferrel
Please read the docs. There is no need to $set users since they are
attached to usage events and can be detected automatically. In fact
"$set"ting them is ignored. There are no properties of users that are not
calculated based on named “indicators’, which can be profile type things.

Fot this application I’d ask myself what you want the user to do? Do you
want them to view a house listing or schedule a visit? Clearly you want
them to rent but there will only be one rent per user so it is not very
strong for indicating taste.

If you have something like 10 visits per user on average you may have
enough to use as the primary indicator since visits are closer to “rent”,
intuitively, Page views, which may be 10x - 100x more than visits are your
last resort. But if page views is the best “primary” indicators you have,
still use visits and rents as secondary. Users have many motivations for
looking at listing and they may be only to look at higher priced units that
they have any intent of renting or to compare something they would not rent
to what they would. Therefor page views are somewhat removed from the pure
user intent of every “rent” but they may be the best indicator you have.

Also consider using things like search terms as secondary indicators.

Then send primary and all secondary events with whatever ids correspond to
the event type. User profile data is harder to use and not a useful as
people think but is still encoded as an indicator but with different
“eventName”. Something like “location” could be used and might have and id
like a postal code—something that is large enough to include other users
but small enough to be exclusive also.

The above will give you several “usage events” with one primary.

Business rule—which are used to restrict results—require you to $set
properties for every rental. So anything in the fields part of a query must
correspond to a possible property of items. Those look ok below.

Please use the Google group for questions. Github is for bug reports.


From: Amit Assaraf  
Reply: actionml/universal-recommender


Date: July 6, 2018 at 10:11:10 AM
To: actionml/universal-recommender


Cc: Subscribed 

Subject:  [actionml/universal-recommender] Boosting categories only shows
one category type (#55)

I have an app that uses Universal Recommender. The app is an app for
finding a house for rent.
I want to recommend users houses based on houses they viewed or scheduled a
tour on already.

I added all the users using the $set event.
I added all (96,676) the houses in the app like so:

predictionio_client.create_event(
event="$set",
entity_type="item",
entity_id=listing.meta.id,
properties={
  "property_type": ["villa"] # There are many
types of property_types such as "apartment"
}
)

And I add the events of the house view & schedule like so:

predictionio_client.create_event(
event="view",
entity_type="user",
entity_id=request.user.username,
target_entity_type="item",
target_entity_id=listing.meta.id
)

Now I want to get predictions for my users based on the property_types they
like.
So I send a prediction query boosting the property_types they like using
Business Rules like so:

{
'fields': [
{
 'bias': 1.05,
 'values': ['single_family_home', 'private_house',
'villa', 'cottage'],
 'name': 'property_type'
}
 ],
 'num': 15,
 'user': 'amit70'
}

Which I would then expect that I would get recommendations of different
types such as private_house or villa or cottage. But for some weird reason
while having over 95,000 houses of different property types I only get
recommendations of *ONE* single type (in this case villa) but if I remove
it from the list it just recommends 10 houses of ONE different type.
This is the response of the query:

{
"itemScores": [
{
"item": "56.39233,-4.11707|villa|0",
"score": 9.42542
},
{
"item": "52.3288,1.68312|villa|0",
"score": 9.42542
},
{
"item": "55.898878,-4.617019|villa|0",
"score": 8.531346
},
{
"item": "55.90713,-3.27626|villa|0",
"score": 8.531346
},
.

I cant understand why this is happening. The elasticsearch query this
translates to is this:
GET /recommender/_search

{
  "from": 0,
  "size": 15,
  "query": {
"bool": {
  "should": [
{
  "terms": {
"schedule": [
  "32.1439352176,34.833260278|private_house|0",
  "31.7848439,35.2047335|apartment_for_sale|0"
]
  }
},
{
  "terms": {
"view": [
  "32.0734919,34.7722675|garden_apartment|0",
  "32.1375986782,34.8415740159|apartment|0",
  "32.0774,

Re: Digging into UR algorithm

2018-07-02 Thread Pat Ferrel
The CCO algorithm test for correlation with a statistic called the Log
Likelihood Ratio (LLR). This compares relative frequencies of 4 different
things 2 having to do with the entire dataset 2 having to do with the 2
events being compared for correlation. Popularity is normalized out of this
comparison but does play a small indirect part in having engough data to
make better guesses about correlation.

Also remember that the secondary event may have item-ids that are not part
of the primary event. For instance if you have good search data then one
(of several) secondary event might be (user-id, "searched-for”,
search-term) This as a secondary event has proven to be quite useful in at
least one dataset I’ve seen.


From: Pat Ferrel  
Reply: Pat Ferrel  
Date: July 2, 2018 at 12:18:16 PM
To: user@predictionio.apache.org 
, Sami Serbey 

Cc: actionml-user 

Subject:  Re: Digging into UR algorithm

The only requirement is that someone performed the primary event on A and
the secondary event is correlated to that primary event.

the UR can recommend to a user who has only performed the secondary event
on B as long as that is in the model. Makes no difference what subset of
events the user has performed, recommendations will work event if the user
has no primary events.

So think of the model as being separate from the user history of events.
Recs are made from user history—whatever it is, but the model must have
some correlated data for each event type you want to use from a user’s
history and sometimes on infrequently seen items there is no model data for
some event types.

Popularity has very little to do with recommendations except for the fact
that you are more likely to have good correlated events. In fact we do
things to normalize/down weight highly popular things because otherwise
recommendations are worse. You can tell this by doing cross-validation
tests for popular vs collaborative filtering using the CCO algorithm behind
the UR.

If you want popular items you can make a query with no user-id and you will
get the most popular. Also if there are not enough recommendations for a
user’s history data we fill in with popular.

Your questions don’t quite match how the algorithm works so hopefully this
straightens out some things.

BTW community support for the UR is here:
https://groups.google.com/forum/#!forum/actionml-user


From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: July 2, 2018 at 9:32:01 AM
To: user@predictionio.apache.org 

Subject:  Digging into UR algorithm

Hi guys,

So I've been playing around with the UR algorithm and I would like to know
2 things if it is possible:

1- Does UR recommend items that are linked to primary event only? Like if
item A is pruchased (primary event) 1 time and item B is liked (secondary
event) 50 times, does UR only recommend item A as the popular one even
though item B have x50 secondary event? Is there a way to play around this?

2- When I first read about UR I thought that it recommend items based on
the frequency of secondary events to primary events. ie: if 50 likes
(secondary event) of item A lead to the purchase of item B and 1 view
(secondary event) of item A lead to the purchase of item C, when someone
view and like item A he will get recommended item B and C with equal score
disregarding the 50 likes vs 1 view. Is that the correct behavior or am I
missing something? Does all secondary event have same weight of influence
for the recommender?

I hope that you can help me out understanding UR template.

Regards,
Sami Serbey



--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CAOtZQD8CU5fVvZ9C32Cj6YaC1F%2B7oxWF%2Br21ApKnuajOZOFuoA%40mail.gmail.com
<https://groups.google.com/d/msgid/actionml-user/CAOtZQD8CU5fVvZ9C32Cj6YaC1F%2B7oxWF%2Br21ApKnuajOZOFuoA%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.


Re: Digging into UR algorithm

2018-07-02 Thread Pat Ferrel
The only requirement is that someone performed the primary event on A and
the secondary event is correlated to that primary event.

the UR can recommend to a user who has only performed the secondary event
on B as long as that is in the model. Makes no difference what subset of
events the user has performed, recommendations will work event if the user
has no primary events.

So think of the model as being separate from the user history of events.
Recs are made from user history—whatever it is, but the model must have
some correlated data for each event type you want to use from a user’s
history and sometimes on infrequently seen items there is no model data for
some event types.

Popularity has very little to do with recommendations except for the fact
that you are more likely to have good correlated events. In fact we do
things to normalize/down weight highly popular things because otherwise
recommendations are worse. You can tell this by doing cross-validation
tests for popular vs collaborative filtering using the CCO algorithm behind
the UR.

If you want popular items you can make a query with no user-id and you will
get the most popular. Also if there are not enough recommendations for a
user’s history data we fill in with popular.

Your questions don’t quite match how the algorithm works so hopefully this
straightens out some things.

BTW community support for the UR is here:
https://groups.google.com/forum/#!forum/actionml-user


From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: July 2, 2018 at 9:32:01 AM
To: user@predictionio.apache.org 

Subject:  Digging into UR algorithm

Hi guys,

So I've been playing around with the UR algorithm and I would like to know
2 things if it is possible:

1- Does UR recommend items that are linked to primary event only? Like if
item A is pruchased (primary event) 1 time and item B is liked (secondary
event) 50 times, does UR only recommend item A as the popular one even
though item B have x50 secondary event? Is there a way to play around this?

2- When I first read about UR I thought that it recommend items based on
the frequency of secondary events to primary events. ie: if 50 likes
(secondary event) of item A lead to the purchase of item B and 1 view
(secondary event) of item A lead to the purchase of item C, when someone
view and like item A he will get recommended item B and C with equal score
disregarding the 50 likes vs 1 view. Is that the correct behavior or am I
missing something? Does all secondary event have same weight of influence
for the recommender?

I hope that you can help me out understanding UR template.

Regards,
Sami Serbey


Re: [actionml/universal-recommender] Use properties for recommendation other than categorial? (#54)

2018-07-01 Thread Pat Ferrel
If you can explain your app maybe I can answer. For one thing there is no
way to find numeric proximity, meaning 33 is close to 34.  I think you may
be asking an important question but we need to lay a foundation of more
important things first so if you could answer my other questions about
conversions or describe your app it would help.

BTW the email addresses above are where to get community support for
PredictionIO and the UR. Join the google group for the UR and the mailing
list for PredictionIO.


From: Amit Assaraf  
Reply: actionml/universal-recommender


Date: July 1, 2018 at 2:13:16 PM
To: actionml/universal-recommender


Cc: Pat Ferrel  , Mention
 
Subject:  Re: [actionml/universal-recommender] Use properties for
recommendation other than categorial? (#54)

@pferrel <https://github.com/pferrel> How will using arrays of string work?
I can see how it will work for places but for sizes of houses it seems like
it wouldn't work because how will it know ['34'] is close to ['31']? It
will just look at it as categories. Also house price wouldn't work as it
varies a lot as well.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/actionml/universal-recommender/issues/54#issuecomment-401633545>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAT8S6fsjjTfMmn3yUhVq_HB1ttExJ9tks5uCTtsgaJpZM4U-cmb>
.


Re: [actionml/universal-recommender] Use properties for recommendation other than categorial? (#54)

2018-07-01 Thread Pat Ferrel
Not sure what you mean “taking into account”

What is the primary indicator of user behavior? What is your conversion,
the behavior you want to increase? For E-Com it is “buy” but for other apps
it may be watch, read, like, etc This is the first thing you must record
because it is the essence of collaborative filtering, even for item
similarity. User behavior if the key thing to look at in determining item
similarity. Then you can apply business rules to narrow down an item based
or item-set based query using item properties like city & size_in_sqf.

Are views your conversions? For E-Com views do not predict sales/buys very
well.

From: Amit Assaraf  
Reply: actionml/universal-recommender


Date: July 1, 2018 at 1:20:06 PM
To: actionml/universal-recommender


Cc: Subscribed 

Subject:  Re: [actionml/universal-recommender] Use properties for
recommendation other than categorial? (#54)

This can be simplified by ignoring the user viewed part. Let's say I give
the algorithm a list of items that I know the user viewed and I just need
it to give me products similar to the ones I provide by taking into account
city & size_in_sqf. How do I accomplish that?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
,
or mute the thread

.


Re: [actionml/universal-recommender] Use properties for recommendation other than categorial? (#54)

2018-07-01 Thread Pat Ferrel
This is setting properties for items, which MUST be arrays of strings in
the Universal Recommender. The array may contain only one string, so you
need to change that part. See docs on actionml.com


From: Amit Assaraf  
Reply: actionml/universal-recommender


Date: July 1, 2018 at 12:54:34 PM
To: actionml/universal-recommender


Cc: Subscribed 

Subject:  [actionml/universal-recommender] Use properties for
recommendation other than categorial? (#54)

I'd like to add a property to my item like so:

{
"event" : "$set",
"entityType" : "item",
"entityId" : "some-item-id",
"properties" : {
"city": "New York",
"size_in_sqf": 35
},
"eventTime" : "2015-10-05T21:02:49.228Z"
}

then lets say the user has viewed several items and reported a view event.

Now I want the recommender to recommend me items that are similar to the
one's the user viewed by taking into account the properties city &
size_in_sqf.

That means it should show me items that are in New York and the size_in_sqf
is around 35.
How do I do this? I can't find any tutorial other than the official one on
using UR and I really want to accomplish something similar to this.

thanks!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
, or mute the
thread

.


[jira] [Updated] (MAHOUT-2048) There are duplicate content pages which need redirects instead

2018-06-27 Thread Pat Ferrel (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-2048:
---
Sprint: 0.14.0 Release

> There are duplicate content pages which need redirects instead
> --
>
> Key: MAHOUT-2048
> URL: https://issues.apache.org/jira/browse/MAHOUT-2048
> Project: Mahout
>  Issue Type: Planned Work
>  Components: website
>Affects Versions: 0.13.0, 0.14.0
>    Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 0.13.0, 0.14.0
>
>
> I have duplicated content in 3 places in the `website/` directory. We need to 
> have one place for the real content and replace the dups with redirect to the 
> actual content. This looks like is may be true for several other pages and 
> honestly I'm not sure if they are all needed but there are many links out in 
> the wild that point to the old path for the CCO recommender pages so we 
> should do this for the ones below at least. Better yet we may want to clean 
> out any other dups unless someone knows why not.
> TLDR;
> Actual content:
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/docs/latest/algorithms/recommenders/cco.md
>  
> Dups to be replaced with redirects to the above content. I vaguely remember 
> all these different site structures so there may be links to them in the wild.
> mahout/website/recommender-overview.md => 
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/users/algorithms/intro-cooccurrence-spark.md => 
> mahout/website/docs/latest/algorithms/recommenders/cco.md
> mahout/website/users/recommender/quickstart.md => 
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/users/recommender/intro-cooccurrence-spark.md => 
> mahout/website/docs/latest/algorithms/recommenders/cco.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MAHOUT-2048) There are duplicate content pages which need redirects instead

2018-06-27 Thread Pat Ferrel (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-2048:
---
Affects Version/s: 0.14.0
Fix Version/s: 0.14.0

> There are duplicate content pages which need redirects instead
> --
>
> Key: MAHOUT-2048
> URL: https://issues.apache.org/jira/browse/MAHOUT-2048
> Project: Mahout
>  Issue Type: Planned Work
>  Components: website
>Affects Versions: 0.13.0, 0.14.0
>    Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 0.13.0, 0.14.0
>
>
> I have duplicated content in 3 places in the `website/` directory. We need to 
> have one place for the real content and replace the dups with redirect to the 
> actual content. This looks like is may be true for several other pages and 
> honestly I'm not sure if they are all needed but there are many links out in 
> the wild that point to the old path for the CCO recommender pages so we 
> should do this for the ones below at least. Better yet we may want to clean 
> out any other dups unless someone knows why not.
> TLDR;
> Actual content:
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/docs/latest/algorithms/recommenders/cco.md
>  
> Dups to be replaced with redirects to the above content. I vaguely remember 
> all these different site structures so there may be links to them in the wild.
> mahout/website/recommender-overview.md => 
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/users/algorithms/intro-cooccurrence-spark.md => 
> mahout/website/docs/latest/algorithms/recommenders/cco.md
> mahout/website/users/recommender/quickstart.md => 
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/users/recommender/intro-cooccurrence-spark.md => 
> mahout/website/docs/latest/algorithms/recommenders/cco.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MAHOUT-2048) There are duplicate content pages which need redirects instead

2018-06-27 Thread Pat Ferrel (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525607#comment-16525607
 ] 

Pat Ferrel commented on MAHOUT-2048:


found another dup

mahout/website/docs/latest/tutorials/intro-cooccurrence-spark/index.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md

 

I want to update the content of these so getting the redirects would really 
help. I plan to only update what is in 
mahout/website/docs/latest/algorithms/recommenders/

> There are duplicate content pages which need redirects instead
> --
>
> Key: MAHOUT-2048
> URL: https://issues.apache.org/jira/browse/MAHOUT-2048
> Project: Mahout
>  Issue Type: Planned Work
>  Components: website
>Affects Versions: 0.13.0
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Minor
> Fix For: 0.13.0
>
>
> I have duplicated content in 3 places in the `website/` directory. We need to 
> have one place for the real content and replace the dups with redirect to the 
> actual content. This looks like is may be true for several other pages and 
> honestly I'm not sure if they are all needed but there are many links out in 
> the wild that point to the old path for the CCO recommender pages so we 
> should do this for the ones below at least. Better yet we may want to clean 
> out any other dups unless someone knows why not.
> TLDR;
> Actual content:
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/docs/latest/algorithms/recommenders/cco.md
>  
> Dups to be replaced with redirects to the above content. I vaguely remember 
> all these different site structures so there may be links to them in the wild.
> mahout/website/recommender-overview.md => 
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/users/algorithms/intro-cooccurrence-spark.md => 
> mahout/website/docs/latest/algorithms/recommenders/cco.md
> mahout/website/users/recommender/quickstart.md => 
> mahout/website/docs/latest/algorithms/recommenders/index.md
> mahout/website/users/recommender/intro-cooccurrence-spark.md => 
> mahout/website/docs/latest/algorithms/recommenders/cco.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MAHOUT-2048) There are duplicate content pages which need redirects instead

2018-06-27 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2048:
--

 Summary: There are duplicate content pages which need redirects 
instead
 Key: MAHOUT-2048
 URL: https://issues.apache.org/jira/browse/MAHOUT-2048
 Project: Mahout
  Issue Type: Planned Work
  Components: website
Affects Versions: 0.13.0
Reporter: Pat Ferrel
Assignee: Andrew Musselman
 Fix For: 0.13.0


I have duplicated content in 3 places in the `website/` directory. We need to 
have one place for the real content and replace the dups with redirect to the 
actual content. This looks like is may be true for several other pages and 
honestly I'm not sure if they are all needed but there are many links out in 
the wild that point to the old path for the CCO recommender pages so we should 
do this for the ones below at least. Better yet we may want to clean out any 
other dups unless someone knows why not.



TLDR;

Actual content:

mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/docs/latest/algorithms/recommenders/cco.md

 

Dups to be replaced with redirects to the above content. I vaguely remember all 
these different site structures so there may be links to them in the wild.


mahout/website/recommender-overview.md => 
mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/users/algorithms/intro-cooccurrence-spark.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md

mahout/website/users/recommender/quickstart.md => 
mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/users/recommender/intro-cooccurrence-spark.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MAHOUT-2048) There are duplicate content pages which need redirects instead

2018-06-27 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2048:
--

 Summary: There are duplicate content pages which need redirects 
instead
 Key: MAHOUT-2048
 URL: https://issues.apache.org/jira/browse/MAHOUT-2048
 Project: Mahout
  Issue Type: Planned Work
  Components: website
Affects Versions: 0.13.0
Reporter: Pat Ferrel
Assignee: Andrew Musselman
 Fix For: 0.13.0


I have duplicated content in 3 places in the `website/` directory. We need to 
have one place for the real content and replace the dups with redirect to the 
actual content. This looks like is may be true for several other pages and 
honestly I'm not sure if they are all needed but there are many links out in 
the wild that point to the old path for the CCO recommender pages so we should 
do this for the ones below at least. Better yet we may want to clean out any 
other dups unless someone knows why not.



TLDR;

Actual content:

mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/docs/latest/algorithms/recommenders/cco.md

 

Dups to be replaced with redirects to the above content. I vaguely remember all 
these different site structures so there may be links to them in the wild.


mahout/website/recommender-overview.md => 
mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/users/algorithms/intro-cooccurrence-spark.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md

mahout/website/users/recommender/quickstart.md => 
mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/users/recommender/intro-cooccurrence-spark.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: a question about a high availability of Elasticsearch cluster

2018-06-22 Thread Pat Ferrel
This should work with any node down. Elasticsearch should elect a new
master.

What version of PIO are you using? PIO and the UR changed the client from
the transport client to the RET client in 0.12.0, which is why you are
using port 9200.

Do all PIO functions work correctly like:

   - pio app list
   - pio app new

with all the configs and missing nodes you describe? What I’m trying to
find out is if the problem is only with queries, which do use ES is a
different way.

What is the es.nodes setting in the engine.json’s sparkConf?


From: jih...@braincolla.com  
Date: June 22, 2018 at 12:53:48 AM
To: actionml-user 

Subject:  a question about a high availability of Elasticsearch cluster

Hello Pat,

May I have a question about Elasticsearch cluster in PIO and UR.

I've implemented some Elasticsearch cluster consisted of 3 nodes on below
options.

**
cluster.name: my-search-cluster
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“node 1”, “node 2", “node 3”]

And I writed PIO options below.

**
...
# Elasticsearch Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch

# The next line should match the ES cluster.name in ES config
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=my-search-cluster

# For clustered Elasticsearch (use one host/port if not clustered)
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=node1,node2,node3
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200
...

My questions are below.

1. I killed the Elasticsearch process in node 2 or node 3. PIO is well
working. But when the Elasticsearch process in node 1 is killed, PIO is not
working. Is it right?

2. I've changed PIO options below. I killed the Elasticsearch process in
node 1 or node 3. PIO is well working. But when the Elasticsearch in node 2
is killed, PIO is not working. Is it right?
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=node2,node1,node3

3. In my opinion, if first node configurd at
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS is killed, PIO is not working. Is
it right? If yes, please let me know why it happened.

Thank you.
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/254aed9f-c975-4726-8b90-2ea80d6a2a34%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout.


Re: UR trending ranking as separate process

2018-06-20 Thread Pat Ferrel
Yes, we support “popular”, “trending”, and “hot” as methods for ranking items. 
The UR queries are backfilled with these items if there are not enough results. 
So if the users has little history and so only gets 5 out of 10 results based 
on this history, we will automatically return the other 5 from the “popular” 
results. This is the default, if there is no specific config for this.

If you query with no user or item, we will return only from “popular” or 
whatever brand of ranking you have setup.

To change which type of ranking you want you can specify the period to use in 
calculating the ranking and which method from “popular”, “trending”, and “hot”. 
These roughly correspond to # of conversion, speed of conversion, and 
acceleration in conversions, if that helps.

Docs here: http://actionml.com/docs/ur_config Search for “rankings" 


From: Sami Serbey 
Reply: user@predictionio.apache.org 
Date: June 20, 2018 at 10:25:53 AM
To: user@predictionio.apache.org , Pat Ferrel 

Cc: user@predictionio.apache.org 
Subject:  Re: UR trending ranking as separate process  

Hi George,

I didn't get your question but I think I am missing something. So you're using 
the Universal Recommender and you're getting a sorted output based on the 
trending items? Is that really a thing in this template? May I please know how 
can you configure the template to get such output? I really hope you can answer 
that. I am also working with the UR template.

Regards,
Sami Serbey

Get Outlook for iOS
From: George Yarish 
Sent: Wednesday, June 20, 2018 7:45:12 PM
To: Pat Ferrel
Cc: user@predictionio.apache.org
Subject: Re: UR trending ranking as separate process
 
Matthew, Pat

Thanks for the answers and concerns. Yes, we want to calculate every 30 minutes 
trending for the last X hours, there X might be even few days. So realtime 
analogy is correct. 

On Wed, Jun 20, 2018 at 6:50 PM, Pat Ferrel  wrote:
No the trending algorithm is meant to look at something like trends over 2 
days. This is because it looks at 2 buckets of conversion frequencies and if 
you cut them smaller than a day you will have so much bias due to daily 
variations that the trends will be invalid. In other words the ups and downs 
over a day period need to be made irrelevant and taking day long buckets is the 
simplest way to do this. Likewise for “hot” which needs 3 buckets and so takes 
3 days worth of data. 

Maybe what you need is to just count conversions for 30 minutes as a realtime 
thing. For every item, keep conversions for the last 30 minutes, sort them 
periodically by count. This is a Kappa style algorithm doing online learning, 
not really supported by PredictionIO. You will have to experiment with the 
length of time since a too small period will be very noisy, popping back and 
forth between items semi-randomly.


From: George Yarish 
Reply: user@predictionio.apache.org 
Date: June 20, 2018 at 8:34:10 AM
To: user@predictionio.apache.org 
Subject:  UR trending ranking as separate process 

Hi!

Not sure this is correct place to ask, since my question correspond to UR 
specifically, not to pio itself I guess. 

Anyway, we are using UR template for predictionio and we are about to use 
trending ranking for sorting UR output. If I understand it correctly ranking is 
created during training and stored in ES. Our training takes ~ 3 hours and we 
launch it daily by scheduler but for trending rankings we want to get actual 
information every 30 minutes.

That means we want to separate training (scores calculation) and ranking 
calculation and launch them by different schedule.

Is there any easy way to achieve it? Does UR supports something like this?

Thanks,
George



-- 






George Yarish, Java Developer


Grid Dynamics


197101, Rentgena Str., 5A, St.Petersburg, Russia

Cell: +7 950 030-1941


Read Grid Dynamics' Tech Blog



Re: UR trending ranking as separate process

2018-06-20 Thread Pat Ferrel
No the trending algorithm is meant to look at something like trends over 2
days. This is because it looks at 2 buckets of conversion frequencies and
if you cut them smaller than a day you will have so much bias due to daily
variations that the trends will be invalid. In other words the ups and
downs over a day period need to be made irrelevant and taking day long
buckets is the simplest way to do this. Likewise for “hot” which needs 3
buckets and so takes 3 days worth of data.

Maybe what you need is to just count conversions for 30 minutes as a
realtime thing. For every item, keep conversions for the last 30 minutes,
sort them periodically by count. This is a Kappa style algorithm doing
online learning, not really supported by PredictionIO. You will have to
experiment with the length of time since a too small period will be very
noisy, popping back and forth between items semi-randomly.


From: George Yarish  
Reply: user@predictionio.apache.org 

Date: June 20, 2018 at 8:34:10 AM
To: user@predictionio.apache.org 

Subject:  UR trending ranking as separate process

Hi!

Not sure this is correct place to ask, since my question correspond to UR
specifically, not to pio itself I guess.

Anyway, we are using UR template for predictionio and we are about to use
trending ranking for sorting UR output. If I understand it correctly
ranking is created during training and stored in ES. Our training takes ~ 3
hours and we launch it daily by scheduler but for trending rankings we want
to get actual information every 30 minutes.

That means we want to separate training (scores calculation) and ranking
calculation and launch them by different schedule.

Is there any easy way to achieve it? Does UR supports something like this?

Thanks,
George


Re: java.util.NoSuchElementException: head of empty list when running train

2018-06-19 Thread Pat Ferrel
Yes, those instructions tell you to run HDFS in pseudo-cluster mode. What
do you see in the HDFS GUI on localhost:50070 ?

Those setup instructions create a pseudo-clustered Spark, and HDFS/HBase.
This runs on a single machine but as the page says, are configured so you
can easily expand to a cluster by replacing config to point to remote HDFS
or Spark clusters.

One fix, if you don’t want to run those services in pseudo-cluster mode is:

1) remove any mention of PGSQL or jdbc, we are not using it. These are not
found on the page you linked to and are not used.
2) on a single machine you can put the dummy/empty model file in LOCALFS so
change the lines
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS
PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://localhost:9000/models
to
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE= LOCALFS
PIO_STORAGE_SOURCES_HDFS_PATH=/path/to/models
substituting with a directory where you want to save models

Running them in a pseudo-cluster mode gives you GUIs to see job progress
and browse HDFS for files, among other things. We recommend it for helping
to debug problems when you get to large amounts of data and begin running
out of resources.


From: Anuj Kumar  
Date: June 19, 2018 at 10:35:02 AM
To: p...@occamsmachete.com  
Cc: user@predictionio.apache.org 
, actionml-u...@googlegroups.com
 
Subject:  Re: java.util.NoSuchElementException: head of empty list when
running train

Hi Pat,
  Read it on the below link

http://actionml.com/docs/single_machine

here is the pio-env.sh

SPARK_HOME=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.6

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar

MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

HBASE_CONF_DIR=/usr/local/hbase/conf

PIO_FS_BASEDIR=$HOME/.pio_store

PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines

PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta

PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event

PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model

PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc

PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio

PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio

PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/els

PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=pio

PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs

PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://localhost:9000/models

PIO_STORAGE_SOURCES_HBASE_TYPE=hbase

PIO_STORAGE_SOURCES_HBASE_HOME=/usr/local/hbase

Thanks,
Anuj Kumar



On Tue, Jun 19, 2018 at 9:16 PM Pat Ferrel  wrote:

> Can you show me where on the AML site it says to store models in HDFS, it
> should not say that? I think that may be from the PIO site so you should
> ignore it.
>
> Can you share your pio-env? You need to go through the whole workflow from
> pio build, pio train, to pio deploy using a template from the same
> directory and with the same engine.json and pio-env and I suspect something
> is wrong in pio-env.
>
>
> From: Anuj Kumar 
> 
> Date: June 19, 2018 at 1:28:11 AM
> To: p...@occamsmachete.com  
> Cc: user@predictionio.apache.org 
> , actionml-u...@googlegroups.com
>  
> Subject:  Re: java.util.NoSuchElementException: head of empty list when
> running train
>
> Tried with basic engine.json mentioned at UL site examples. Seems to work
> but got stuck at "pio deploy" throwing following error
>
> [ERROR] [OneForOneStrategy] Failed to invert: [B@35c7052
>
>
> before that "pio train" was successful but gave following error. I suspect
> because of this reason "pio deploy" is not working. Please help
>
> [ERROR] [HDFSModels] File /models/pio_modelAWQXIr4APcDlNQi8DwVj could only
> be replicated to 0 nodes instead of minReplication (=1).  There are 0
> datanode(s) running and no node(s) are excluded in this operation.
>
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1726)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2565)
>
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)
>
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
> at
> org.

Re: java.util.NoSuchElementException: head of empty list when running train

2018-06-19 Thread Pat Ferrel
Can you show me where on the AML site it says to store models in HDFS, it
should not say that? I think that may be from the PIO site so you should
ignore it.

Can you share your pio-env? You need to go through the whole workflow from
pio build, pio train, to pio deploy using a template from the same
directory and with the same engine.json and pio-env and I suspect something
is wrong in pio-env.


From: Anuj Kumar  
Date: June 19, 2018 at 1:28:11 AM
To: p...@occamsmachete.com  
Cc: user@predictionio.apache.org 
, actionml-u...@googlegroups.com
 
Subject:  Re: java.util.NoSuchElementException: head of empty list when
running train

Tried with basic engine.json mentioned at UL site examples. Seems to work
but got stuck at "pio deploy" throwing following error

[ERROR] [OneForOneStrategy] Failed to invert: [B@35c7052


before that "pio train" was successful but gave following error. I suspect
because of this reason "pio deploy" is not working. Please help

[ERROR] [HDFSModels] File /models/pio_modelAWQXIr4APcDlNQi8DwVj could only
be replicated to 0 nodes instead of minReplication (=1).  There are 0
datanode(s) running and no node(s) are excluded in this operation.

at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1726)

at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)

at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2565)

at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)

at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)

at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)

at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)


On Tue, Jun 19, 2018 at 10:45 AM Anuj Kumar 
wrote:

> Sure, here it is.
>
> {
>
>   "comment":" This config file uses default settings for all but the
> required values see README.md for docs",
>
>   "id": "default",
>
>   "description": "Default settings",
>
>   "engineFactory": "com.actionml.RecommendationEngine",
>
>   "datasource": {
>
> "params" : {
>
>   "name": "sample-handmad",
>
>   "appName": "np",
>
>   "eventNames": ["read", "search", "view", "category-pref"],
>
>   "minEventsPerUser": 1,
>
>   "eventWindow": {
>
> "duration": "300 days",
>
> "removeDuplicates": true,
>
> "compressProperties": true
>
>   }
>
> }
>
>   },
>
>   "sparkConf": {
>
> "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
>
> "spark.kryo.registrator":
> "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
>
> "spark.kryo.referenceTracking": "false",
>
> "spark.kryoserializer.buffer": "300m",
>
> "spark.executor.memory": "4g",
>
> "spark.executor.cores": "2",
>
> "spark.task.cpus": "2",
>
> "spark.default.parallelism": "16",
>
> "es.index.auto.create": "true"
>
>   },
>
>   "algorithms": [
>
> {
>
>   "comment": "simplest setup where all values are default, popularity
> based backfill, must add eventsNames",
>
>   "name": "ur",
>
>   "params": {
>
> "appName": "np",
>
>     "indexName": "np",
>
> "typeName": "items",
>
> "blacklistEvents": [],
>
> "comment": "must have data for the first event or the model will
> not build, other events are optional",
>
> "indicators": [
>
>

Re: java.util.NoSuchElementException: head of empty list when running train

2018-06-18 Thread Pat Ferrel
This sounds like some missing required config in engine.json. Can you share
the file?


From: Anuj Kumar  
Reply: user@predictionio.apache.org 

Date: June 18, 2018 at 5:05:22 AM
To: user@predictionio.apache.org 

Subject:  java.util.NoSuchElementException: head of empty list when running
train

Getting this while running "pio train". Please help

Exception in thread "main" java.util.NoSuchElementException: head of empty
list

at scala.collection.immutable.Nil$.head(List.scala:420)

at scala.collection.immutable.Nil$.head(List.scala:417)

at
org.apache.mahout.math.cf.SimilarityAnalysis$.crossOccurrenceDownsampled(SimilarityAnalysis.scala:177)

at com.actionml.URAlgorithm.calcAll(URAlgorithm.scala:343)

at com.actionml.URAlgorithm.train(URAlgorithm.scala:295)

at com.actionml.URAlgorithm.train(URAlgorithm.scala:180)

at
org.apache.predictionio.controller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)

at
org.apache.predictionio.controller.Engine$$anonfun$18.apply(Engine.scala:690)

at
org.apache.predictionio.controller.Engine$$anonfun$18.apply(Engine.scala:690)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

at scala.collection.immutable.List.foreach(List.scala:381)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

at scala.collection.immutable.List.map(List.scala:285)

at org.apache.predictionio.controller.Engine$.train(Engine.scala:690)

at org.apache.predictionio.controller.Engine.train(Engine.scala:176)

at
org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)

at
org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:251)

at
org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


--
-
Best,
Anuj Kumar


Re: Few Queries Regarding the Recommendation Template

2018-06-13 Thread Pat Ferrel
Wow that page should be reworded or removed. They are trying to talk about
ensemble models, which are a valid thing but they badly misapply it there.
The application to multiple data types is just wrong and I know because I
tried exactly what they are suggesting but with cross-validation tests to
measure how much worse things got.

For instance if you use buy and dislike what kind of result are you going
to get if you have 2 models? One set of results will recommend “buy” the
other will tell you what a user is likely to “dislike”. How do you combine
them?

Ensembles are meant to use multiple *algorithms* and do something like
voting on recommendations. But you have to pay close attention to what the
algorithm uses as input and what it recommends. All members of the ensemble
must recommend the same action to the user.

Whoever contributed this statement: The default algorithm described in DASE
<https://predictionio.apache.org/templates/similarproduct/dase/#algorithm> uses
user-to-item view events as training data. However, your application may
have more than one type of events which you want to take into account, such
as buy, rate and like events. One way to incorporate other types of events
to improve the system is to add another algorithm to process these events,
build a separated model and then combine the outputs of multiple algorithms
during Serving.

Is patently wrong. Ensembles must recommend the same action to users and
unless each algorithm in the ensemble is recommending the same thing (all
be it with slightly different internal logic) then you will get gibberish
out. The winner of the Netflix prize did an ensemble with 107 (IIRC)
different algorithms all using exactly the same input data. There is no
principle that says if you feed conflicting data into several ensemble
algorithms that you will get diamonds out.

Furthermore using view events is bad to begin with because the recommender
will recommend what it thinks you want to view. We did this once with a
large dataset from a big E-Com company where we did cross-validation tests
using “buy” alone, “view” alone,  and ensembles of “buy” and “view”. We got
far better results using buy alone than using buy with ~100x as many
“views". The intent of the user and how they find things to view is so
different than when they finally come to buy something that adding view
data got significantly worse results. This is because people have different
reasons to view—maybe a flashy image, maybe a promotion, maybe some
placement bias, etc. This type of browsing “noise” pollutes the data which
can no longer be used to recommend “buy”s. We did several experiments
including comparing several algorithms types with “buy” and “view” events.
“view” always lost to “buy” no matter the algo we used (they were all
unimodal). There may be some exception to this result out there but it will
be accidental, not because it is built into the algorithm. When I say this
worsened results I’m not talking about some tiny fraction of a %, I’m
talking about a decrease of 15-20%

You could argue that “buy”, “like”, and rate will produce similar results
but from experience I can truly say that view and dislike will not.

Since the method described on the site is so sensitive to the user intent
recorded in events I would never use something like that without doing
cross-validation tests and then you are talking about a lot of work. There
is no theoretical or algorithmic correlation detection built into the
ensemble method so you may or may not get good results and I can say
unequivocally that the exact thing they describe will give worse results
(or at least it did in our experiments). You cannot ignore the intent
behind the data you use as input unless this type of correlation detection
is built into the algorithm and with the ensemble method described this
issue is completely ignored.

The UR uses the Correlated Cross-Occurrence algorithm for this exact reason
and was invented to solve the problem we found using “buy” and “view” data
together.  Let’s take a ridiculous extreme and use “dislikes" to recommend
“likes”? Does that even make sense? Check out an experiment with CCO where
we did this exact thing:
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/

OK, rant over :-) Thanks for bringing up one of the key issues being
addressed by modern recommenders—multimodality. It is being addressed in
scientific ways, unfortunately the page on PIO’s site gets it wrong.




From: KRISH MEHTA  
Reply: KRISH MEHTA  
Date: June 13, 2018 at 2:19:17 PM
To: Pat Ferrel  
Subject:  Re: Few Queries Regarding the Recommendation Template

I Understand but if I just want the likes, dislikes and views then I can
combine the algorithms right? Given in the link:
https://predictionio.apache.org/templates/similarproduct/multi-events-multi-algos/
I
hope this works.

On Jun 13, 2018, at 1:19 PM, Pat Ferrel  wrote:

I would strongly recommend against using ratings. N

Re: Few Queries Regarding the Recommendation Template

2018-06-13 Thread Pat Ferrel
I would strongly recommend against using ratings. No one uses these as
input to recommenders anymore. Netflix doesn’t even show ratings. The best
input to a recommender is a conversion, buy, watch, listen, etc depending
on the item type. But the recommender you are using only allows one of
these as input. ALS is unimodal. There is no way to combine different
inputs with weighting that is valid with plain matrix factorization. So
ratings (if you choose to ignore my advice) and views cannot be mixed. For
one thing the math requires either implicit or explicit values for input,
but cannot really mix the 2 and for another thing—as I said—it is unimodal.
If there are instructions that say you can mix different data like ratings
and views it is wrong. A unimodal recommender can only find the user’s
intent from one type of signal at a time. If you train on views it will
recommend the user view something and this may be very different than
buying something. I know this because I’ve done experiments on this issue.

The Universal Recommender is the only multimodal recommender that I know of
that works with PIO. Factorization Machines are also multimodal but much
harder to use and there is no PIO template for them anyway.

To use the UR I would suggest using conversions (buy), high ratings = like,
low ratings = dislike, and views (I assume you are talking about detail
page views) as boolean “did view” input. The UR will find correlations
between this multimodal data and make the best recommendations based on
this.  You can also set “dislike” to filter out any recommendation where
the user has already expressed the fact that they dislike the item.
http://actionml.com/docs/ur


From: KRISH MEHTA  
Reply: user@predictionio.apache.org 

Date: June 13, 2018 at 12:06:16 PM
To: user@predictionio.apache.org 

Subject:  Few Queries Regarding the Recommendation Template

Hi,
I am new to PredictionIO and I have gone through the tutorial provided
regarding the customer buying and rating products. I encountered queries
regarding those.
1. What if I change the rating of the product? Will it update the result in
the database? Like will it use the most recent rating?
2. If I want to recommend a product with implicit as well as explicit
content? Is there a link which helps me to understand the same or anyone
can help me with it? I have gone through the tutorial and it says that for
implicit it adds the number of views to decide whether the viewer likes or
dislikes it. But what if I want to recommend a user with its likes and
dislikes as well as the number of views. For eg, Even if the user has
viewed it 1000’s of times but if it dislikes the product then it should
affect the recommendation. Can anyone suggest me with a simpler way or so I
have to make major changes in my code?

I hope my questions are genuine and not mundane.

Regards,
Krish


Re: True Negative - ROC Curve

2018-06-12 Thread Pat Ferrel
We do not use these for recommenders. The precision rate is low when the
lift in your KPI like sales is relatively high. This is not like
classification.

We use MAP@k with increasing values of k. This should yield a diminishing
mean average precision chart with increasing k. This tells you 2 things; 1)
you are guessing in the right order, Map@1 greater than MAP@2 means your
first guess is better than than your second. The rate of decrease tells you
how fast the precision drops off with higher k. And 2) the baseline MAP@k
for future comparisons to tuning your engine or in champion/challenger
comparisons before putting into A/B tests.

Also note that RMSE has been pretty much discarded as an offline metric for
recommenders, it only really gives you a metric for ratings, and who cares
about that. No one wants to optimize rating guess anymore, conversions are
all that matters and precision is the way to measure potential conversion
since it actually measures how precise our guess about that the user
actually converted on in the test set. Ranking is next most important since
you have a limited number of recommendations to show, you want the best
ranked first. MAP@k over a range of k does this but clients often try to
read sales lift in this and there is no absolute relationship. You can
guess at one once you have A/B test results, and you should also compare
non-recommendation results like random recs, or popular recs. If MAP is
lower or close to these, you may not have a good recommender or data.

AUC is not for every task. In this case the only positive is a conversion
in the test data and the only negative is the absence of conversion and the
ROC curve will be nearly useless


From: Nasos Papageorgiou 

Reply: user@predictionio.apache.org 

Date: June 12, 2018 at 7:17:04 AM
To: user@predictionio.apache.org 

Subject:  True Negative - ROC Curve

Hi all,

I want to use ROC curve (AUC - Area Under the Curve) for evaluation of
recommended system in case of retailer. Could you please give an example of
True Negative value?

i.e. True Positive is the number of items on the Recommended List that are
appeared on the test data set, where the test data set may be the 20%  of
the full data.

Thank you.




Virus-free.
www.avast.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: Regarding Real-Time Prediction

2018-06-11 Thread Pat Ferrel
Actually if you are using the Universal Recommender you only need to deploy 
once as long as the engine.json does not change. The hot swap happens as 
@Digambar says and there is literally no downtime. If you are using any of the 
other recommenders you do have to re-deploy after every train but the deploy 
happens very quickly, a ms or 2 as I recall.


From: Digambar Bhat 
Reply: user@predictionio.apache.org 
Date: June 11, 2018 at 9:38:15 AM
To: user@predictionio.apache.org 
Subject:  Re: Regarding Real-Time Prediction  

You don't need to deploy same engine again and again. You just deploy once and 
train whenever you want. Deployed instance will automatically point to newly 
trained model as hot swap happens. 

Regards,
Digambar

On Mon 11 Jun, 2018, 10:02 PM KRISH MEHTA,  wrote:
Hi,
I have just started using PredictionIO and according to the documentation I 
have to always run the Train and Deploy Command to get the prediction. I am 
working on predicting videos for recommendation and I want to know if there is 
any other way possible so that I can predict the results on the Fly with no 
Downtime.

Please help me with the same.

Yours Sincerely,
Krish

Re: UR template minimum event number to recommend

2018-06-04 Thread Pat Ferrel
No but we have 2 ways to handle this situation automatically and you can
tell if recommendations are not from personal user history.


   1. when there is not enough user history to recommend, we fill in the
   lower ranking recommendations with popular, trending, or hot items. Not
   completely irrelevant but certainly not as good as if we had more data for
   them.
   2. You can also mix item and user-based recs. So if you have an item,
   perhaps from the page or screen the user is looking at, you can send both
   user and item in the query. If you want user-based, boost it higher with
   the userBias. Then is the query cannot send back user-based it will fill in
   with item-based. This only works in certain situations where you have some
   example item.

As always if you do a user-based query and all scores are 0, you know that
no real recommendations are included and can take some other action.


From: Krajcs Ádám  
Reply: user@predictionio.apache.org 

Date: June 4, 2018 at 5:14:33 AM
To: user@predictionio.apache.org 

Subject:  UR template minimum event number to recommend

Hi,



Is it possible to configure somehow the universal recommender to recommend
items to user with minimum number of event? For example the user with 2
view events usually get unrelevant recommendations, but 5 events would be
enough.



Thanks!



Regads,

Adam Krajcs


Re: Prediction IO on temporary spark cluster on AWS EMR

2018-05-30 Thread Pat Ferrel
Search the archives we had a discussion about this aa few days ago


From: ANKIT HALDAR  
Reply: user@predictionio.apache.org 

Date: May 30, 2018 at 12:12:52 PM
To: user@predictionio.apache.org 

Subject:  Prediction IO on temporary spark cluster on AWS EMR

Hi,

I am Ankit Haldar, working for a startup in India. I wanted to know about
the best ways we can scale PredictionIO using temporary training spark
cluster. It would be great if you could give me some guidance.

Thanks!
Ankit Haldar


Re: PIO 0.12.1 with HDP Spark on YARN

2018-05-29 Thread Pat Ferrel
Yarn has to be started explicitly. Usually it is part of Hadoop and is
started with Hadoop. Spark only contains the client for Yarn (afaik).



From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 6:45:43 PM
To: user@predictionio.apache.org 

Subject:  Re: PIO 0.12.1 with HDP Spark on YARN

That's the command that I'm using but it gives me the exception that I
listed in the previous email.  I've installed a Spark standalone cluster
and am using that for training for now but would like to use Spark on YARN
eventually.

Are you using HDP? If so, what version of HDP are you using?  I'm using
*HDP-2.6.2.14.*



On Tue, May 29, 2018 at 8:55 PM, suyash kharade 
wrote:

> I use 'pio train -- --master yarn'
> It works for me to train universal recommender
>
> On Tue, May 29, 2018 at 8:31 PM, Miller, Clifford <
> clifford.mil...@phoenix-opsgroup.com> wrote:
>
>> To add more details to this.  When I attempt to execute my training job
>> using the command 'pio train -- --master yarn' I get the exception that
>> I've included below.  Can anyone tell me how to correctly submit the
>> training job or what setting I need to change to make this work.  I've made
>> not custom code changes and am simply using PIO 0.12.1 with the
>> SimilarProduct Recommender.
>>
>>
>>
>> [ERROR] [SparkContext] Error initializing SparkContext.
>> [INFO] [ServerConnector] Stopped Spark@1f992a3a{HTTP/1.1}{0.0.0.0:4040}
>> [WARN] [YarnSchedulerBackend$YarnSchedulerEndpoint] Attempted to request
>> executors before the AM has registered!
>> [WARN] [MetricsSystem] Stopping a MetricsSystem that is not running
>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$se
>> tEnvFromInputString$1.apply(YarnSparkHadoopUtil.scala:154)
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$se
>> tEnvFromInputString$1.apply(YarnSparkHadoopUtil.scala:152)
>> at scala.collection.IndexedSeqOptimized$class.foreach(
>> IndexedSeqOptimized.scala:33)
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.
>> scala:186)
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.setEnvFrom
>> InputString(YarnSparkHadoopUtil.scala:152)
>> at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$
>> 6.apply(Client.scala:819)
>> at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$
>> 6.apply(Client.scala:817)
>> at scala.Option.foreach(Option.scala:257)
>> at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.sc
>> ala:817)
>> at org.apache.spark.deploy.yarn.Client.createContainerLaunchCon
>> text(Client.scala:911)
>> at org.apache.spark.deploy.yarn.Client.submitApplication(Client
>> .scala:172)
>> at org.apache.spark.scheduler.cluster.YarnClientSchedulerBacken
>> d.start(YarnClientSchedulerBackend.scala:56)
>> at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSched
>> ulerImpl.scala:156)
>> at org.apache.spark.SparkContext.(SparkContext.scala:509)
>> at org.apache.predictionio.workflow.WorkflowContext$.apply(
>> WorkflowContext.scala:45)
>> at org.apache.predictionio.workflow.CoreWorkflow$.runTrain(
>> CoreWorkflow.scala:59)
>> at org.apache.predictionio.workflow.CreateWorkflow$.main(Create
>> Workflow.scala:251)
>> at org.apache.predictionio.workflow.CreateWorkflow.main(CreateW
>> orkflow.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:751)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:187)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.
>> scala:212)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:
>> 126)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>>
>>
>> On Tue, May 29, 2018 at 12:01 AM, Miller, Clifford <
>> clifford.mil...@phoenix-opsgroup.com> wrote:
>>
>>> So updating the version in the RELEASE file to 2.1.1 fixed the version
>>> detection problem but I'm still not able to submit Spark jobs unless they
>>> are strictly local.  How are you submitting to the HDP Spark?
>>>
>>> Thanks,
>>>
>>> --Cliff.
>>>
>>>
>>>
>>> On Mon, May 28, 2018 at 1:12 AM, suyash kharade <
>>> suyash.khar...@gmail.com> wrote:
>>>
 Hi Miller,
 I faced same issue.
 It is giving error as release file has '-' in version
 Insert simple version in release file something like 2.6.

 On Mon, May 28, 2018 at 4:32 AM, Miller, Cliff

Re: Spark cluster error

2018-05-29 Thread Pat Ferrel
BTW the way we worked around this was to scale up the driver machine to
handle the executors too-et voila. All worked but, our normal strategy of
using remote Spark is now somehow broken. We upgraded everything to the
latest stable and may have messed up some config. So not sure where the
problem is, just looking for a clue we haven’t already thought of.


From: Pat Ferrel  
Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 2:14:23 PM
To: Donald Szeto  ,
user@predictionio.apache.org 

Subject:  Re: Spark cluster error

Yes, the spark-submit --jars is where we started to find the missing class.
The class isn’t found on the remote executor so we looked in the jars
actually downloaded into the executor’s work dir. the PIO assembly jars are
there are do have the classes. This would be in the classpath of the
executor, right? Not sure what you are asking.

Are you asking about the SPARK_CLASSPATH in spark-env.sh? The default
should include the work subdir for the job, I believe. and it can only be
added to so we couldn’t have messed that up if it points first to the
work/job-number dir, right?

I guess the root of my question is how can the jars be downloaded to the
executor’s work dir and still the classes we know are in the jar are not
found?


From: Donald Szeto  
Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 1:27:03 PM
To: user@predictionio.apache.org 

Subject:  Re: Spark cluster error

Sorry, what I meant was the actual spark-submit command that PIO was using.
It should be in the log.

What Spark version was that? I recall classpath issues with certain
versions of Spark.

On Thu, May 24, 2018 at 4:52 PM, Pat Ferrel  wrote:

> Thanks Donald,
>
> We have:
>
>- built pio with hbase 1.4.3, which is what we have deployed
>- verified that the `ProtobufUtil` class is in the pio hbase assembly
>- verified the assembly is passed in --jars to spark-submit
>- verified that the executors receive and store the assemblies in the
>FS work dir on the worker machines
>- verified that hashes match the original assembly so the class is
>being received by every executor
>
> However the executor is unable to find the class.
>
> This seems just short of impossible but clearly possible. How can the
> executor deserialize the code but not find it later?
>
> Not sure what you mean the classpath going in to the cluster? The classDef
> not found does seem to be in the pio 0.12.1 hbase assembly, isn’t this
> where it should get it?
>
> Thanks again
> p
>
>
> From: Donald Szeto  
> Reply: user@predictionio.apache.org 
> 
> Date: May 24, 2018 at 2:10:24 PM
> To: user@predictionio.apache.org 
> 
> Subject:  Re: Spark cluster error
>
> 0.12.1 packages HBase 0.98.5-hadoop2 in the storage driver assembly.
> Looking at Git history it has not changed in a while.
>
> Do you have the exact classpath that has gone into your Spark cluster?
>
> On Wed, May 23, 2018 at 1:30 PM, Pat Ferrel  wrote:
>
>> A source build did not fix the problem, has anyone run PIO 0.12.1 on a
>> Spark cluster? The issue seems to be how to pass the correct code to Spark
>> to connect to HBase:
>>
>> [ERROR] [TransportRequestHandler] Error while invoking
>> RpcHandler#receive() for one-way message.
>> [ERROR] [TransportRequestHandler] Error while invoking
>> RpcHandler#receive() for one-way message.
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
>> failure: Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.hadoop.hbase.protobuf.ProtobufUtil
>> at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convert
>> StringToScan(TableMapReduceUtil.java:521)
>> at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(
>> TableInputFormat.java:110)
>> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRD
>> D.scala:170)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
>> (edited)
>>
>> Now that we have these pluggable DBs did I miss something? This works
>> with master=local but not with remote Spark master
>>
>> I’ve passed in the hbase-client in the --jars part of spark-submit, still
>> fails, what am

Re: Spark cluster error

2018-05-29 Thread Pat Ferrel
Yes, the spark-submit --jars is where we started to find the missing class.
The class isn’t found on the remote executor so we looked in the jars
actually downloaded into the executor’s work dir. the PIO assembly jars are
there are do have the classes. This would be in the classpath of the
executor, right? Not sure what you are asking.

Are you asking about the SPARK_CLASSPATH in spark-env.sh? The default
should include the work subdir for the job, I believe. and it can only be
added to so we couldn’t have messed that up if it points first to the
work/job-number dir, right?

I guess the root of my question is how can the jars be downloaded to the
executor’s work dir and still the classes we know are in the jar are not
found?


From: Donald Szeto  
Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 1:27:03 PM
To: user@predictionio.apache.org 

Subject:  Re: Spark cluster error

Sorry, what I meant was the actual spark-submit command that PIO was using.
It should be in the log.

What Spark version was that? I recall classpath issues with certain
versions of Spark.

On Thu, May 24, 2018 at 4:52 PM, Pat Ferrel  wrote:

> Thanks Donald,
>
> We have:
>
>- built pio with hbase 1.4.3, which is what we have deployed
>- verified that the `ProtobufUtil` class is in the pio hbase assembly
>- verified the assembly is passed in --jars to spark-submit
>- verified that the executors receive and store the assemblies in the
>FS work dir on the worker machines
>- verified that hashes match the original assembly so the class is
>being received by every executor
>
> However the executor is unable to find the class.
>
> This seems just short of impossible but clearly possible. How can the
> executor deserialize the code but not find it later?
>
> Not sure what you mean the classpath going in to the cluster? The classDef
> not found does seem to be in the pio 0.12.1 hbase assembly, isn’t this
> where it should get it?
>
> Thanks again
> p
>
>
> From: Donald Szeto  
> Reply: user@predictionio.apache.org 
> 
> Date: May 24, 2018 at 2:10:24 PM
> To: user@predictionio.apache.org 
> 
> Subject:  Re: Spark cluster error
>
> 0.12.1 packages HBase 0.98.5-hadoop2 in the storage driver assembly.
> Looking at Git history it has not changed in a while.
>
> Do you have the exact classpath that has gone into your Spark cluster?
>
> On Wed, May 23, 2018 at 1:30 PM, Pat Ferrel  wrote:
>
>> A source build did not fix the problem, has anyone run PIO 0.12.1 on a
>> Spark cluster? The issue seems to be how to pass the correct code to Spark
>> to connect to HBase:
>>
>> [ERROR] [TransportRequestHandler] Error while invoking
>> RpcHandler#receive() for one-way message.
>> [ERROR] [TransportRequestHandler] Error while invoking
>> RpcHandler#receive() for one-way message.
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
>> failure: Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.hadoop.hbase.protobuf.ProtobufUtil
>> at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convert
>> StringToScan(TableMapReduceUtil.java:521)
>> at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(
>> TableInputFormat.java:110)
>> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRD
>> D.scala:170)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
>> (edited)
>>
>> Now that we have these pluggable DBs did I miss something? This works
>> with master=local but not with remote Spark master
>>
>> I’ve passed in the hbase-client in the --jars part of spark-submit, still
>> fails, what am I missing?
>>
>>
>> From: Pat Ferrel  
>> Reply: Pat Ferrel  
>> Date: May 23, 2018 at 8:57:32 AM
>> To: user@predictionio.apache.org 
>> 
>> Subject:  Spark cluster error
>>
>> Same CLI works using local Spark master, but fails using remote master
>> for a cluster due to a missing class def for protobuf used in hbase. We are
>> using the binary dist 0.12.1.  Is this known? Is there a work around?
>>
>> We are now trying a source build in hope the class will be put in the
>> assembly passed to Spark and the reasoning is that the executors don’t
>> contain hbase classes but when you run a local executor it does, due to
>> some local classpath. If the source built assembly does not have these
>> classes, we will have the same problem. Namely how to get protobuf to the
>> executors.
>>
>> Has anyone seen this?
>>
>>
>


Re: pio app new failed in hbase

2018-05-29 Thread Pat Ferrel
No, this is as expected. When you run pseudo-distributed everything
internally is configured as if the services were on separate machines. See
clustered instructions here: http://actionml.com/docs/small_ha_cluster This
is to setup 3 machines running different parts and is not really the best
physical architecture but does illustrate how a distributed setup would go.

BTW we (ActionML) use containers now to do this setup but it still works.
The smallest distributed cluster that makes sense for the Universal
Recommender is 5 machines. 2 dedicated to Spark, which can be started and
stopped around the `pio train` process. So 3 are permanent; one for PIO
servers (EventServer and PredictionServer) one for HDFS+HBase, one for
Elasticsearch. This allows you to vertically scale by increasing the size
of the service instances in-place (easy with AWS), then horizontally scale
HBase or Elasticsearch, or Spark independently if vertical scaling is not
sufficient. You can also combine the 2 Spark instances as long as you
remember that the `pio train` process creates a Spark Driver on the machine
the process is launched on and so the driver may need to be nearly as
powerful as a Spark Executor. The Spark Driver is an “invisible" and
therefore often overlooked member of the Spark cluster. It is often but not
always smaller than the executors, to put it on the PIO servers machine is
therefore dangerous in terms of scaling unless you know the resources it
will need. Using Yarn can but the Driver on the cluster (off the launching
machine) but is more complex than the default Spark “standalone” config.

The Universal Recommender is the exception here because it does not require
a big non-local Spark for anything but training, so we move the `pio train`
process to a Spark “Driver” machine that is ephemeral as the Spark
Executor(s) is(are). Other templates may require Spark in train and deploy.
Once the UR’s training is done it will automatically swap in the new model
so the running deployed PredictionServer will automatically start using
it—no re-deploy needed.


From: Marco Goldin  
Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 6:38:21 AM
To: user@predictionio.apache.org 

Subject:  Re: pio app new failed in hbase

i was able to solve the issue deleting hbase folder in hdfs with "hdfs dfs
-rm -r /hbase" and restarting hbase.
now app creation in pio is working again.

I still wonder why this problem happen though, i'm running hbase in
pseudo-distributed mode (for testing purposes everything, from spark to
hadoop, is in a single machine), could a problem for prediction in managing
the apps?

2018-05-29 13:47 GMT+02:00 Marco Goldin :

> Hi all, i deleted all old apps from prediction (currently running 0.12.0)
> but when i'm creating a new one i get this error from hbase.
> I inspected hbase from shell but there aren't any table inside.
>
>
> ```
>
> pio app new mlolur
>
> [INFO] [HBLEvents] The table pio_event:events_1 doesn't exist yet.
> Creating now...
>
> Exception in thread "main" org.apache.hadoop.hbase.TableExistsException:
> org.apache.hadoop.hbase.TableExistsException: pio_event:events_1
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> prepareCreate(CreateTableProcedure.java:299)
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> executeFromState(CreateTableProcedure.java:106)
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> executeFromState(CreateTableProcedure.java:58)
>
> at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(
> StateMachineProcedure.java:119)
>
> at org.apache.hadoop.hbase.procedure2.Procedure.
> doExecute(Procedure.java:498)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(
> ProcedureExecutor.java:1147)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> execLoop(ProcedureExecutor.java:942)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> execLoop(ProcedureExecutor.java:895)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> access$400(ProcedureExecutor.java:77)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$
> 2.run(ProcedureExecutor.java:497)
>
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
>
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>
> at org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>
> at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:95)
>
> at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(
> RpcRetryingCaller.java:209)
>
> at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(
> RpcRetryingCaller.java:223)
>
> at org.apache.hadoop.hbase.client

Re: PIO not using HBase cluster

2018-05-25 Thread Pat Ferrel
How are you starting the EventServer? You should not use pio-start-all
which assumes all services are local

configurre pio-env.sh with your remote hbase
start es with `pio eventserver &` or some method where it won’t kill the es
when you log off like `nohup pio eventserver &`
this should not start a local hbase so you should have your remote one
running
Same for the remote Elasticsearch and HDFS, they should be in pio-env.sh
and already started
pio status should be fine with the remote HBase


From: Miller, Clifford 

Reply: Miller, Clifford 

Date: May 25, 2018 at 10:16:01 AM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 

Subject:  Re: PIO not using HBase cluster

I'll keep you informed.  However, I'm having issues getting past this.  If
I have hbase installed with the clusters config files then it still does
not communicate with the cluster.  It does start hbase but on the local PIO
server.  If I ONLY have the hbase config (which worked in version 0.10.0)
then pio-start-all gives the following message.


 pio-start-all
Starting Elasticsearch...
Starting HBase...
/home/centos/PredictionIO-0.12.1/bin/pio-start-all: line 65:
/home/centos/PredictionIO-0.12.1/vendors/hbase/bin/start-hbase.sh: No such
file or directory
Waiting 10 seconds for Storage Repositories to fully initialize...
Starting PredictionIO Event Server...


"pio status" then returns:


 pio status
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.1 is installed at
/home/centos/PredictionIO-0.12.1
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at
/home/centos/PredictionIO-0.12.1/vendors/spark
[INFO] [Management$] Apache Spark 2.1.1 detected (meets minimum requirement
of 1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)...
[WARN] [DomainSocketFactory] The short-circuit local reads feature cannot
be used because libhadoop cannot be loaded.
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
[ERROR] [RecoverableZooKeeper] ZooKeeper exists failed after 1 attempts
[ERROR] [ZooKeeperWatcher] hconnection-0x558756be, quorum=localhost:2181,
baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
[WARN] [ZooKeeperRegistry] Can't retrieve clusterId from Zookeeper
[ERROR] [StorageClient] Cannot connect to ZooKeeper (ZooKeeper ensemble:
localhost). Please make sure that the configuration is pointing at the
correct ZooKeeper ensemble. By default, HBase manages its own ZooKeeper, so
if you have not configured HBase to use an external ZooKeeper, that means
your HBase is not started or configured properly.
[ERROR] [Storage$] Error initializing storage client for source HBASE.
org.apache.hadoop.hbase.ZooKeeperConnectionException: Can't connect to
ZooKeeper
at
org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.java:2358)
at
org.apache.predictionio.data.storage.hbase.StorageClient.(StorageClient.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.predictionio.data.storage.Storage$.getClient(Storage.scala:252)
at
org.apache.predictionio.data.storage.Storage$.org$apache$predictionio$data$storage$Storage$$updateS2CM(Storage.scala:283)
at
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:244)
at
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:244)
at
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
at
scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:244)
at
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:315)
at
org.apache.predictionio.data.storage.Storage$.getDataObjectFromRepo(Storage.scala:300)
at
org.apache.predictionio.data.storage.Storage$.getLEvents(Storage.scala:448)
at
org.apache.predictionio.data.storage.Storage$.verifyAllDataObjects(Storage.scala:384)
at
org.apache.predictionio.tools.commands.Management$.status(Management.scala:156)
at org.apache.predictionio.tools.console.Pio$.status(Pio.scala:155)
at
org.apache.predictionio.tools.console.Console$$anonfun$main$1.apply(Console.scala:721)
at
org.apache.predictionio.tools.console.Console$$anonfun$main$1.apply(Console.scala:656)
at scala.O

Re: PIO not using HBase cluster

2018-05-25 Thread Pat Ferrel
No, you need to have HBase installed, or at least the config installed on
the PIO machine. The pio-env.sh defined servers will be  configured cluster
operations and will be started separately from PIO. PIO then will not start
hbase and try to sommunicate only, not start it. But PIO still needs config
for the client code that is in the pio assembly jar.

Some services were not cleanly separated between client, master, and slave
so complete installation is easiest though you can figure out the minimum
with experimentation and I think it is just the conf directory.

BTW we have a similar setup and are having trouble with the Spark training
phase getting a `classDefNotFound: org.apache.hadoop.hbase.ProtobufUtil` so
can you let us know how it goes?



From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 25, 2018 at 9:43:46 AM
To: user@predictionio.apache.org 

Subject:  PIO not using HBase cluster

I'm attempting to use a remote cluster with PIO 0.12.1.  When I run
pio-start-all it starts the hbase locally and does not use the remote
cluster as configured.  I've copied the HBase and Hadoop conf files from
the cluster and put them into the locally configured directories.  I set
this up in the past using a similar configuration but was using PIO
0.10.0.  When doing this with this version I could start pio with only the
hbase and hadoop conf present.  This does not seem to be the case any
longer.

If I only put the cluster configs then it complains that it cannot find
start-hbase.sh.  If I put a hbase installation with cluster configs then it
will start a local hbase and not use the remote cluster.

Below is my PIO configuration



#!/usr/bin/env bash
#
# Safe config that will work if you expand your cluster later
SPARK_HOME=$PIO_HOME/vendors/spark
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch
HADOOP_CONF_DIR=$PIO_HOME/vendors/hadoop/conf
HBASE_CONF_DIR==$PIO_HOME/vendors/hbase/conf


# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

# Need to use HDFS here instead of LOCALFS to enable deploying to
# machines without the local model
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

# What store to use for what data
# Elasticsearch Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch
# The next line should match the ES cluster.name in ES config
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=dsp_es_cluster

# For clustered Elasticsearch (use one host/port if not clustered)
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=ip-10-0-1-136.us-gov-west-1.compute.internal,ip-10-0-1-126.us-gov-west-1.compute.internal,ip-10-0-1-126.us-gov-west-1.compute.internal
#PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300,9300,9300
#PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO 0.12.0+ uses the REST client for ES 5+ and this defaults to
# port 9200, change if appropriate but do not use the Transport Client port
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200

PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs
PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://ip-10-0-1-138.us-gov-west-1.compute.internal:8020/models

# HBase Source config
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase

# Hbase clustered config (use one host/port if not clustered)
PIO_STORAGE_SOURCES_HBASE_HOSTS=ip-10-0-1-138.us-gov-west-1.compute.internal,ip-10-0-1-209.us-gov-west-1.compute.internal,ip-10-0-1-79.us-gov-west-1.compute.internal
~


Re: Spark2 with YARN

2018-05-24 Thread Pat Ferrel
I’m having a java.lang.NoClassDefFoundError in a different context and
different class. Have you tried this without Yarn? Sorry I can’t find the
rest of this thread.


From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 24, 2018 at 4:16:58 PM
To: user@predictionio.apache.org 

Subject:  Spark2 with YARN

I've setup a cluster using Hortonworks HDP with Ambari all running in AWS.
I then created a separate EC2 instance and installed PIO 0.12.1, hadoop,
elasticsearch, hbase, and spark2.  I copied the configurations from the HDP
cluster and then pio-start-all.  The pio-start-all completes successfully
and running "pio status" also shows success.  I'm following the "Text
Classification Engine Tutorial".  I've imported the data.  I'm using the
following command to train: "pio train -- --master yarn".  After running
the command I get the following exception.  Does anyone have any ideas of
what I may have missed during my setup?

Thanks in advance.

#
Exception follows:

Exception in thread "main" java.lang.NoClassDefFoundError:
com/sun/jersey/api/client/config/ClientConfig
at
org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:152)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
at org.apache.spark.SparkContext.(SparkContext.scala:509)
at
org.apache.predictionio.workflow.WorkflowContext$.apply(WorkflowContext.scala:45)
at
org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:59)
at
org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:251)
at
org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 20 more

##


Re: Spark cluster error

2018-05-24 Thread Pat Ferrel
Thanks Donald,

We have:

   - built pio with hbase 1.4.3, which is what we have deployed
   - verified that the `ProtobufUtil` class is in the pio hbase assembly
   - verified the assembly is passed in --jars to spark-submit
   - verified that the executors receive and store the assemblies in the FS
   work dir on the worker machines
   - verified that hashes match the original assembly so the class is being
   received by every executor

However the executor is unable to find the class.

This seems just short of impossible but clearly possible. How can the
executor deserialize the code but not find it later?

Not sure what you mean the classpath going in to the cluster? The classDef
not found does seem to be in the pio 0.12.1 hbase assembly, isn’t this
where it should get it?

Thanks again
p


From: Donald Szeto  
Reply: user@predictionio.apache.org 

Date: May 24, 2018 at 2:10:24 PM
To: user@predictionio.apache.org 

Subject:  Re: Spark cluster error

0.12.1 packages HBase 0.98.5-hadoop2 in the storage driver assembly.
Looking at Git history it has not changed in a while.

Do you have the exact classpath that has gone into your Spark cluster?

On Wed, May 23, 2018 at 1:30 PM, Pat Ferrel  wrote:

> A source build did not fix the problem, has anyone run PIO 0.12.1 on a
> Spark cluster? The issue seems to be how to pass the correct code to Spark
> to connect to HBase:
>
> [ERROR] [TransportRequestHandler] Error while invoking
> RpcHandler#receive() for one-way message.
> [ERROR] [TransportRequestHandler] Error while invoking
> RpcHandler#receive() for one-way message.
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
> failure: Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.hadoop.hbase.protobuf.ProtobufUtil
> at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.
> convertStringToScan(TableMapReduceUtil.java:521)
> at org.apache.hadoop.hbase.mapreduce.TableInputFormat.
> setConf(TableInputFormat.java:110)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(
> NewHadoopRDD.scala:170)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
> (edited)
>
> Now that we have these pluggable DBs did I miss something? This works with
> master=local but not with remote Spark master
>
> I’ve passed in the hbase-client in the --jars part of spark-submit, still
> fails, what am I missing?
>
>
> From: Pat Ferrel  
> Reply: Pat Ferrel  
> Date: May 23, 2018 at 8:57:32 AM
> To: user@predictionio.apache.org 
> 
> Subject:  Spark cluster error
>
> Same CLI works using local Spark master, but fails using remote master for
> a cluster due to a missing class def for protobuf used in hbase. We are
> using the binary dist 0.12.1.  Is this known? Is there a work around?
>
> We are now trying a source build in hope the class will be put in the
> assembly passed to Spark and the reasoning is that the executors don’t
> contain hbase classes but when you run a local executor it does, due to
> some local classpath. If the source built assembly does not have these
> classes, we will have the same problem. Namely how to get protobuf to the
> executors.
>
> Has anyone seen this?
>
>


Re: Spark cluster error

2018-05-23 Thread Pat Ferrel
A source build did not fix the problem, has anyone run PIO 0.12.1 on a
Spark cluster? The issue seems to be how to pass the correct code to Spark
to connect to HBase:

[ERROR] [TransportRequestHandler] Error while invoking RpcHandler#receive()
for one-way message.
[ERROR] [TransportRequestHandler] Error while invoking RpcHandler#receive()
for one-way message.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure:
Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.hadoop.hbase.protobuf.ProtobufUtil
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertStringToScan(TableMapReduceUtil.java:521)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:110)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:170)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
(edited)

Now that we have these pluggable DBs did I miss something? This works with
master=local but not with remote Spark master

I’ve passed in the hbase-client in the --jars part of spark-submit, still
fails, what am I missing?


From: Pat Ferrel  
Reply: Pat Ferrel  
Date: May 23, 2018 at 8:57:32 AM
To: user@predictionio.apache.org 

Subject:  Spark cluster error

Same CLI works using local Spark master, but fails using remote master for
a cluster due to a missing class def for protobuf used in hbase. We are
using the binary dist 0.12.1.  Is this known? Is there a work around?

We are now trying a source build in hope the class will be put in the
assembly passed to Spark and the reasoning is that the executors don’t
contain hbase classes but when you run a local executor it does, due to
some local classpath. If the source built assembly does not have these
classes, we will have the same problem. Namely how to get protobuf to the
executors.

Has anyone seen this?


Spark cluster error

2018-05-23 Thread Pat Ferrel
Same CLI works using local Spark master, but fails using remote master for
a cluster due to a missing class def for protobuf used in hbase. We are
using the binary dist 0.12.1.  Is this known? Is there a work around?

We are now trying a source build in hope the class will be put in the
assembly passed to Spark and the reasoning is that the executors don’t
contain hbase classes but when you run a local executor it does, due to
some local classpath. If the source built assembly does not have these
classes, we will have the same problem. Namely how to get protobuf to the
executors.

Has anyone seen this?


RE: Problem with training in yarn cluster

2018-05-23 Thread Pat Ferrel
he.predictionio.data.storage.Storage$.getPDataObject(Storage.scala:307)

at 
org.apache.predictionio.data.storage.Storage$.getPEvents(Storage.scala:454)

at 
org.apache.predictionio.data.store.PEventStore$.eventsDb$lzycompute(PEventStore.scala:37)

at 
org.apache.predictionio.data.store.PEventStore$.eventsDb(PEventStore.scala:37)

at 
org.apache.predictionio.data.store.PEventStore$.find(PEventStore.scala:73)

at com.actionml.DataSource.readTraining(DataSource.scala:76)

at com.actionml.DataSource.readTraining(DataSource.scala:48)

at 
org.apache.predictionio.controller.PDataSource.readTrainingBase(PDataSource.scala:40)

at org.apache.predictionio.controller.Engine$.train(Engine.scala:642)

at org.apache.predictionio.controller.Engine.train(Engine.scala:176)

at 
org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)

at 
org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:251)

at 
org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)

Caused by: com.google.protobuf.ServiceException:
java.net.UnknownHostException: unknown host: hbase-master

at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1678)

at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)

at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:42561)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceStubMaker.isMasterRunning(HConnectionManager.java:1682)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(HConnectionManager.java:1591)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$StubMaker.makeStub(HConnectionManager.java:1617)

... 36 more

Caused by: java.net.UnknownHostException: unknown host: hbase-master

at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.(RpcClient.java:385)

at 
org.apache.hadoop.hbase.ipc.RpcClient.createConnection(RpcClient.java:351)

at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1530)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442)

at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)

... 41 more







*From: *Ambuj Sharma 
*Sent: *23 May 2018 08:59
*To: *user@predictionio.apache.org
*Cc: *Wojciech Kowalski 
*Subject: *Re: Problem with training in yarn cluster



Hi wojciech,

 I also faced many problems while setting yarn with PredictionIO. This may
be the case where yarn is tyring to findout pio.log file on hdfs cluster.
You can try "--master yarn --deploy-mode client ". you need to pass this
configuration with pio train

e.g., pio train -- --master yarn --deploy-mode client








Thanks and Regards

Ambuj Sharma

Sunrise may late, But Morning is sure.....

Team ML

Betaout



On Wed, May 23, 2018 at 4:53 AM, Pat Ferrel  wrote:

Actually you might search the archives for “yarn” because I don’t recall
how the setup works off hand.



Archives here:
https://lists.apache.org/list.html?user@predictionio.apache.org



Also check the Spark Yarn requirements and remember that `pio train … --
various Spark params` allows you to pass arbitrary Spark params exactly as
you would to spark-submit on the pio command line. The double dash
separates PIO and Spark params.




From: Pat Ferrel  
Reply: user@predictionio.apache.org 

Date: May 22, 2018 at 4:07:38 PM
To: user@predictionio.apache.org 
, Wojciech Kowalski 



Subject:  RE: Problem with training in yarn cluster



What is the command line for `pio train …` Specifically are you using
yarn-cluster mode? This causes the driver code, which is a PIO process, to
be executed on an executor. Special setup is required for this.




From: Wojciech Kowalski  
Reply: user@predictionio.apache.org 

Date: May 22, 2018 at 2:28:43 PM
To: user@predictionio.apache.org 

Subject:  RE: Problem with training in yarn cluster



Hello,



Actually I have another error in logs that is actu

  1   2   3   4   5   6   7   8   9   10   >