Forced push to master

2017-08-18 Thread Rahul Iyer
Hi dev-team,

I force-pushed to the master branch to fix an incorrect author assignment I
made on commit titled "MLP: Add multiple enhancements". This practice is
frowned-down upon and I try to avoid it when I can. In this case, however,
I wanted to ensure the right person got credit for the commit.

The result of this push is that all local branches of master would have
diverged from the remote and require updating.

If your master does not contain any local changes different from the remote
then you can perform a reset. For example, if your remote is named upstream:

$ git checkout master
$ git fetch upstream
$ git reset --hard upstream/master # Destroys your work on master

If the local master branch contains commits that need to be saved, then use
rebase:

$ git rebase --onto upstream/master 6f6f804 master
Further, existing PRs would now have a bunch of commits that are completely
unrelated to the PR. These PRs will have to be rebased on top of the
updated master branch to remove the erroneous commits.

I apologize for the inconvenience this causes.

Best,
Rahul


Re: Jenkins madlib-master-build failed

2017-08-11 Thread Rahul Iyer
I've access to the service and had built the MADlib projects on Jenkins.

I believe admin access to Jenkins allows editing *any* project. Roman
wanted us to be careful with such privileges, hence access was provided to
just 1 committer. With the move to TLP, maybe we could add more with admin
access.

- iR

On Fri, Aug 11, 2017 at 9:58 AM, Ed Espino  wrote:

> An observant badminton birdie whispered in the wind "I couldn't find a way
> to re-trigger Jenkins master, is it because I don't have a Jenkins
> account?"
>
> It just so happens that I assist with Apache Jenkins support for the Apache
> HAWQ (incubating) project. I requested access from the mentor (The great,
> powerful and kind Roman). It is he who granted me access to the Apache
> Jenkins service. It is through that privilege that I was able to trigger a
> MADlib master build to get the project back to a green state. I'm not sure
> how many team members on the Apache MADlib project have access to this
> service, but I suggest there are at least a few to assist with its
> maintenance.
>
> Who on the team currently has access to the Apache Jenkins service?
>
> -=e
>
> On Thu, Aug 10, 2017 at 4:15 PM, Ed Espino  wrote:
>
> > FYI: The manually triggered Jenkins master build passed:
> > https://builds.apache.org/view/M-R/view/MADlib/job/
> madlib-master-build/80/
> >
> > -=e
> >
> > On Thu, Aug 10, 2017 at 4:14 PM, Ed Espino  wrote:
> >
> >> Not sure what caused the MADlib master build to fail (git clone issue?).
> >> I have re-triggered it and it is beyond the previous failure point.
> >>
> >> -=e
> >>
> >> Here is the failure for future reference (https://builds.apache.org/vie
> >> w/M-R/view/MADlib/job/madlib-master-build/79/console):
> >>
> >> Checking out Revision 67b69eb8a5eec1ff5d4b947eabb90970d66b2ac5
> >> (refs/remotes/origin/master)
> >> Commit message: "MADLIB-1133. TLP graduation - remove references to
> >> "incubating"."
> >>  > git config core.sparsecheckout # timeout=10
> >>  > git checkout -f 67b69eb8a5eec1ff5d4b947eabb90970d66b2ac5
> >>  > git rev-list 0dc2df94358bb2ec3fd85865a6d53ae7cbde0226 # timeout=10
> >> Extended Email Publisher is currently disabled in project settings
> >> FATAL: Unable to produce a script file
> >> java.io.IOException: Permission denied
> >> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> >> at java.io.File.createTempFile(File.java:2024)
> >> at hudson.FilePath$17.invoke(FilePath.java:1373)
> >> at hudson.FilePath$17.invoke(FilePath.java:1363)
> >> at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2739)
> >> at hudson.remoting.UserRequest.perform(UserRequest.java:153)
> >> at hudson.remoting.UserRequest.perform(UserRequest.java:50)
> >> at hudson.remoting.Request$2.run(Request.java:336)
> >> at hudson.remoting.InterceptingExecutorService$1.call(Intercept
> >> ingExecutorService.java:68)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> >> Executor.java:1142)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> >> lExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:748)
> >> Caused: java.io.IOException: Failed to create a temporary directory in
> >> /tmp
> >> at hudson.FilePath$17.invoke(FilePath.java:1375)
> >> at hudson.FilePath$17.invoke(FilePath.java:1363)
> >> at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2739)
> >> at hudson.remoting.UserRequest.perform(UserRequest.java:153)
> >> at hudson.remoting.UserRequest.perform(UserRequest.java:50)
> >> at hudson.remoting.Request$2.run(Request.java:336)
> >> at hudson.remoting.InterceptingExecutorService$1.call(Intercept
> >> ingExecutorService.java:68)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> >> Executor.java:1142)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> >> lExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:748)
> >> at ..remote call to H21(Native Method)
> >> at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
> >> at hudson.remoting.UserResponse.retrieve(UserRequest.java:253)
> >> at hudson.remoting.Channel.call(Channel.java:830)
> >> at hudson.FilePath.act(FilePath.java:986)
> >> Caused: java.io.IOException: remote file operation failed:
> >> /home/jenkins/jenkins-slave/workspace/madlib-master-build at
> >> hudson.remoting.Channel@4b715ff3:H21
> >> at hudson.FilePath.act(FilePath.java:993)
> >> at hudson.FilePath.act(FilePath.java:975)
> >> at hudson.FilePath.createTextTempFile(FilePath.java:1363)
> >> Caused: java.io.IOException: Failed to create a temp file on
> >> /home/jenkins/jenkins-slave/workspace/madlib-master-build
> >> at hudson.FilePath.createTextTempFile(FilePath.java:1386)
> >> at hudson.tasks.CommandInterpreter.createScriptFile(CommandInte
> >> rpreter.java:162)
> >> at 

Re: [VOTE]: MADlib repo(s) migration

2017-08-09 Thread Rahul Iyer
+0 for either options.

I feel the Github workflow doesn't add much and prefer to not use the
Github merge button. Keeping the history clean when possible requires
merging on local machine, at which point either Github or ASF is just a
change in remote URL.



On Wed, Aug 9, 2017 at 2:47 PM, Orhan Kislal  wrote:

> 1
>
> Orhan Kislal
>
> On Wed, Aug 9, 2017 at 2:32 PM, Nandish Jayaram 
> wrote:
>
> > Hi All,
> >
> > With MADlib's graduation to TLP, it's time to migrate its github
> > repos from `*incubator-madlib*` to `*madlib*`. We will have to open
> > an Apache Infrastructure ticket to request this move for the following
> > repos (along with other stuff like wiki, jenkins etc):
> > https://git1-us-west.apache.org/repos/asf?p=incubator-madlib.git
> >  (Read/Write)
> > https://github.com/apache/incubator-madlib (Github mirror- read only)
> > https://git1-us-west.apache.org/repos/asf?p=incubator-madlib-site.git
> > https://github.com/apache/incubator-madlib-site (GitHub mirror)
> >
> > There are two ways to go about this, and the Infra ticket has to be
> > raised accordingly.
> > 1) Just maintain the current set-up, but have the repos renamed from
> > incubator-madlib to madlib.
> > 2) Use Gitbox to enable github repo as a R/W repo and not just read-only.
> > Check this email (
> > https://mail-archives.apache.org/mod_mbox/incubator-madlib-
> > dev/201708.mbox/%3cCA+ULb+vP0ViWH4Nc=4eaXvbT0KOmeFtQzp4eAa3p0fKPP7c
> > 8...@mail.gmail.com%3e)
> > for further information.
> >
> > Please vote you preference and we can decide to move accordingly.
> >
> > NJ
> >
>


Re: GCC 5, 6 and 7 are not supported by MADlib

2017-08-08 Thread Rahul Iyer
Thanks, Ed.

Current work is aimed towards completing the next release (1.12). We'll
pick up the JIRAs for this issue (MADLIB-1025
, MADLIB-1145
) after the release.

On Tue, Aug 8, 2017 at 8:28 PM, Ed Espino  wrote:

> Sharing with the dev community: I've been working with several different
> Linux (Debian 8.9 & 9,1, Linux Mint 18.2, Fedora 26, Ubuntu 16.04 & 14.04)
> distros trying to see how MADlib builds and runs install-check against
> Postgres 9.6. I have had good success with GCC 4.x and zero success with
> GCC 5.x, 6.x and 7.x versions. There are distros in this list where finding
> a readily available GCC 4.x version isn't straight forward and I
> essentially had to revert to older distro versions. I can assist with
> validation once the other GCC versions (5, 6 & 7) are supported.
>
> -=e
>
> --
> *Ed Espino*
>


Re: MADlib Jenkins project question (madlib-master-build & madlib-pr-build)

2017-08-02 Thread Rahul Iyer
>
> ​
> Question: is the value for the sub-directory arbitrary?  I'd like to
> suggest we set this value to match the repository name (minus .git suffix).
> This will allow us to reference the ${GIT_URL} environment variable
> available to the running shell process.
>
> ​The sub-directory name was chosen arbitrarily and can/should be replaced
by the repo name.
Thanks for updating the project and corresponding script files.

Cheers,
iR ​


Re: Installation issue - OSError: [Errno 2] No such file or directory: '/usr/local/madlib/Versions/1.10.0/ports/postgres'

2017-05-09 Thread Rahul Iyer
+dev for the problem with RPM

Hi Atsushi,

Thanks for bringing this to our notice!

We might have to remove the 1.10 binary from the Apache dist to avoid
others from having this problem. We're in the process of releasing 1.11 and
would redirect to that binary once that goes through the voting process.

@Louis, could you please clear the `/usr/local/madlib` folder and try again
with pgxn (or compiling from source as suggested by Markus)?




On Mon, May 8, 2017 at 11:53 PM, Neki, Atsushi <neki.atsu...@jp.fujitsu.com>
wrote:

> Hi Louis, Rahul,
>
>
>
>
>
> It seems that the installation using RPM binary doesn’t work for 1.10.0.
>
> The RPM doesn’t have anything but hawq under ports directory.
>
>
>
> $ rpm -qlpi ./apache-madlib-1.10.0-incubating-bin-Linux.rpm | grep ports
>
>
>
> /usr/local/madlib/Versions/1.10.0/ports/hawq
>
>   (snip)
>
>
>
> For 1.9.1 RPM binary, it doesn’t look so.
>
>
>
> /usr/local/madlib/Versions/1.9.1/ports/greenplum
>
> /usr/local/madlib/Versions/1.9.1/ports/hawq
>
> /usr/local/madlib/Versions/1.9.1/ports/postgres
>
>
>
>
>
> Unfortunately, I have no idea about the problem with pgxn.
>
>
>
>
>
> Regards,
>
> Atsushi Neki
>
>
>
>
>
> *From:* Markus Paaso [mailto:markus.pa...@gmail.com]
> *Sent:* Saturday, May 6, 2017 2:02 PM
> *To:* u...@madlib.incubator.apache.org
> *Subject:* Re: Installation issue - OSError: [Errno 2] No such file or
> directory: '/usr/local/madlib/Versions/1.10.0/ports/postgres'
>
>
>
> Hi Louis,
>
>
>
> I have installed madlib on Ubuntu 16.04 using following commands:
>
>
>
>
>
> PSQL_HOST="127.0.0.1"
>
> PSQL_DB="testing"
>
> PSQL_USER="testuser"
>
> PSQL_PASS=""
>
>
>
> psql -h $PSQL_HOST template1 -c "CREATE ROLE $PSQL_USER PASSWORD
> '$PSQL_PASS'"
>
> createdb -h $PSQL_HOST $PSQL_DB -O $PSQL_USER
>
>
>
> sudo apt install -y cmake m4
>
> wget https://github.com/apache/incubator-madlib/archive/rel/v1.10.0.tar.gz
>
> tar -xzf v1.10.0.tar.gz
>
> cd incubator-madlib-rel-v1.10.0
>
> ./configure
>
> cd build
>
> make
>
> sudo make install
>
>
>
> MADLIB_USER="mad"
>
> MADLIB_PASS="$(openssl rand -base64 32)"
>
> psql -h $PSQL_HOST $PSQL_DB -c "CREATE USER $MADLIB_USER SUPERUSER
> PASSWORD '$MADLIB_PASS'"
>
> PGPASSWORD="$MADLIB_PASS" /usr/local/madlib/bin/madpack -p postgres -c
> $MADLIB_USER@$PSQL_HOST/$PSQL_DB install
>
>
>
> psql -h $PSQL_HOST $PSQL_DB -c "GRANT ALL PRIVILEGES ON SCHEMA madlib TO
> $PSQL_USER"
>
>
>
>
>
>
>
> Best Regards,
>
> Markus Paaso
>
>
>
>
>
> 2017-05-05 20:11 GMT+03:00 Louis Leblanc <louisleblan...@gmail.com>:
>
> Thanks Rahul,
>
>
>
> I tried both solutions (with pgxn and with the RPM package).
>
>
>
> I didn't make any change to  `/usr/local/madlib/Versions` after the
> installation?
>
>
>
> Thanks.
>
>
>
> 2017-05-05 10:54 GMT-06:00 Rahul Iyer <ri...@apache.org>:
>
> Hi Louis,
>
> Just to clarify: did you use the pgxn to install or the RPM package
> downloaded from Apache dist?
>
> And was there any change made to `/usr/local/madlib/Versions` after the
> installation?
>
>
>
> I'm going to try to reproduce the issue on an Ubuntu VM, so would
> appreciate your exact steps.
>
>
>
> - Rahul
>
>
>
> On Thu, May 4, 2017 at 8:41 AM, Louis Leblanc <louisleblan...@gmail.com>
> wrote:
>
> Thanks Rahul,
>
>
>
> - I used the process described here ==> https://cwiki.apache.org/
> confluence/display/MADLIB/Installation+Guide#InstallationGuide-
> PGXNInstallingfromPGXN(PostgreSQL)
>
> - I used the version apache-madlib-1.10.0-incubating-bin-Linux.rpm
> <https://dist.apache.org/repos/dist/release/incubator/madlib/1.10.0-incubating/apache-madlib-1.10.0-incubating-bin-Linux.rpm>
>
> - Content of the folder `/usr/local/madlib/Versions` ==> 1.10.0
>
>
>
> Louis
>
>
>
>
>
> 2017-05-03 17:31 GMT-06:00 Rahul Iyer <ri...@apache.org>:
>
> Hi Louis,
>
> Please help us understand the problem further.
>
> - What was the process you used to install MADlib?
> - Which version of MADlib did you install?
> - Please print the contents of `/usr/local/madlib/Versions`
>
> - Rahul
>
>
> On Wed, May 3, 2017 at 4:17 PM, Louis Leblanc <louisleblan...@gmail.com>
> wrote:
> > Hello,
> >
> > I'm experiencing issues installing madlib on Ubuntu 16.04 with Postgresql
> >

Re: [VOTE] MADlib v1.11-rc3

2017-05-05 Thread Rahul Iyer
+1

- Checksums, PGP signatures, RAT check are good.
- Source tar ball extracts to a directory named
`apache-madlib-1.11-incubating-src`
- GIT_REVISION set to `rc/1.11-rc3` as discussed in previous RC vote.

On Fri, May 5, 2017 at 10:16 AM, Orhan Kislal  wrote:

> +1
>
> On Fri, May 5, 2017 at 10:15 AM, Joseph Hellerstein <
> hellerst...@berkeley.edu> wrote:
>
> > dmg install smooth with OSX Postgres.app. Looks clean.
> >
> > +1
> >
> > On Fri, May 5, 2017 at 9:34 AM, Frank McQuillan 
> > wrote:
> >
> > > I just want to comment on a couple items raised in the RC1 and RC2
> votes
> > > that pertain to RC3:
> > >
> > > (1)
> > > “I happened to open the file "CMakeLists.txt" in the root directory
> > > and noticed it does not have the standard ASF header. I know there
> > > were IP issues resolved globally for the project recently. I
> > > noticed many of them are excluded in the pom.xml file. Regardless
> > > of the IP issues, shouldn't these files contain the ASF header?”
> > >
> > > Since this file existed before MADlib’s move to ASF, it does not need
> an
> > > ASF header as per the guidance from ASF on this topic
> > > https://issues.apache.org/jira/browse/LEGAL-293
> > >
> > > (2)
> > > “The DMG(apache-madlib-1.11-incubating-bin-Darwin.dmg) contains a
> > > pkg file named "madlib-1.11-Darwin.pkg". Shouldn't it be called
> > > "apache-madlib-1.11-incubating-Darwin.pkg"?
> > >
> > > Similarly, the DMG base folder name is madlib-1.11.Darwin.“
> > >
> > > As per guidance from Roman our mentor, it is not necessary to rename
> all
> > > packages and files.  Also, this may affect some functional tests that
> > look
> > > for certain file names.
> > >
> > > (3)
> > > “There are still three outstanding Jira issues in an "Unresolved" state
> > > with a fix version of v1.11.  Are they going to be resolved soon? They
> > can
> > > be seen with the following url:
> > >
> > > https://issues.apache.org/jira/browse/MADLIB/fixforversion/12339592/?
> > > selectedTab=com.atlassian.jira.jira-projects-plugin:
> > version-summary-panel
> > > ”
> > >
> > > Regarding the JIRAs that are not closed, the actual work has been done
> so
> > > there is nothing material pending.  But I did not close them because I
> > > wanted Roman to do that, since he was the one overseeing them.
> > >
> > > (4)
> > > Convenience binaries are being voted on, as Rashmi’s email calls out.
> > >
> > > (5)
> > > I tried out the RC3 dmg and found that install, reinstall, upgrade work
> > > fine with the soft link on my OS X box on PG 9.6
> > >
> > > So...
> > >
> > >
> > > +1
> > >
> > >
> > >
> > >
> > > On Thu, May 4, 2017 at 6:10 PM, Rashmi Raghu 
> wrote:
> > >
> > > > Hello MADlib community,
> > > >
> > > > We have created a MADlib 1.11 RC-3, with the artifacts below (source
> > and
> > > > convenience binaries) up for a vote.
> > > >
> > > > Note that voting for the RC-2 release has been cancelled due to the
> > need
> > > > for minor corrections based on community feedback. Sorry for the
> > > > inconvenience.
> > > >
> > > > RC-3 replaces RC-2 with the following minor changes:
> > > > * Ensure product naming is consistently 'Apache MADlib (incubating)'
> > > > * Git revision tag changed to rc/1.11-rc3
> > > >
> > > > This will be the 5th release for Apache MADlib (incubating).
> > > >
> > > > The main goals of this release are:
> > > > * new module (PageRank for graph analytics with grouping support
> > > included)
> > > > * improvements to existing modules (add grouping support to Single
> > Source
> > > > Shortest Path, reduce memory footprint of DT and RF, include NULL
> > > features
> > > > in training DT, add support for array and svec output for Pivot
> module,
> > > > utility to unnest 2-D arrays into rows of 1-D arrays)
> > > > * platform updates (GPDB 5)
> > > > * updates for Apache Top Level Project readiness and build process on
> > > > Apache infrastructure
> > > > * bug fixes
> > > > * doc improvements
> > > >
> > > > For more information including release notes, please see:
> > > > https://cwiki.apache.org/confluence/display/MADLIB/MADlib+1.11
> > > >
> > > > *** Please download, review and vote by Tue May 09, 2017 @ 6pm PDT
> ***
> > > >
> > > > We're voting upon the source and convenience binaries below:
> > > >
> > > > Source Repository (tag):  rc/1.11-rc3
> > > > https://github.com/apache/incubator-madlib/tree/rc/1.11-rc3
> > > >
> > > > Source Files and convenience Binaries:
> > > > https://dist.apache.org/repos/dist/dev/incubator/madlib/1.
> > > > 11-incubating-rc3/
> > > >
> > > > Commit:
> > > > https://github.com/apache/incubator-madlib/commit/
> > > > 8e2778a3921aa99f009962756881ce4bea5eee16
> > > >
> > > > KEYS file containing PGP Keys we use to sign the release:
> > > > https://dist.apache.org/repos/dist/dev/incubator/madlib/KEYS
> > > >
> > > > To help in tallying the vote, PMC members please be sure to indicate
> > > > "(binding)" 

Re: [DISCUSS] Graduation

2017-05-04 Thread Rahul Iyer
Hi Roman,

Many thanks for your excellent mentorship!

Your #2 and #3 proposals sound good to me and I look forward to the
discussion on private@.

- Rahul


On Fri, Apr 28, 2017 at 10:47 AM, Roman Shaposhnik  wrote:
> Hi!
>
> with the fifth (v1.11) release in the final stages of being cut,
> I think now would be a good time to officially start our graduation
> discussion. With my mentor hat on, I feel that the project is
> mature and self-reliant enough to qualify as a TLP.
>
> Process-wise graduation consists of drafting a board resolution,
> getting it approved by the IPMC and finally submitting it to the ASF
> board's consideration. At the very minimum your resolution will contain:
> 1. A name of the project (I assume that'll be MADlib)
> 2. A list of proposed PMC members
> 3. A proposed PMC chair
> A good example of a resolution can be found here:
> https://cwiki.apache.org/confluence/display/FINERACT/Graduation+Resolution
>
> In fact, Frank and I took the liberty to use that as the basis for our own:
>  https://cwiki.apache.org/confluence/display/MADLIB/Graduation+Resolution
> Please read it carefully and let us know what do you think.
>
> On #2 my suggestion would be to have an opt-in system. Basically
> we will kick off the thread off on private@madlib asking current PPMC
> members if they are willing to continue on the PMC.
>
> On #3 I typically recommend podlings I mentor to setup a rotating chair
> policy. This is, in no way, an ASF requirement so feel free to ignore it,
> but it worked well before. The chair will be expected up for rotation every
> year. It will be more that ok for the same person to self-nominate once
> the year is up -- but at the same time it'll be up to the same person to
> actually kick off a thread asking if anybody else is interested in serving
> as a chair for the next year. Of course, if there multiple candidates there
> will have to be a vote.
>
> Speaking of self-nomination -- the same thread that we're going to kick
> off as part of solving for #2 will ask for folks to self-nominate as an 
> initial
> chair to be listed on the resolution.
>
> Unless somebody objects strongly to my #2 and #3 proposals I'm going
> to kick of this thread on private@.
>
> With that in mind, lets make the rest of the discussion on dev@ to be about
> collecting the datapoints to present to IPCM as part of us asking them to
> vote YES on our graduation. Lets collect all these data points in the same
> wiki page:
> https://cwiki.apache.org/confluence/display/MADLIB/Graduation+Resolution
> Or if you feel that a discussion may be needed -- just reply to this thread.
>
> Thanks,
> Roman.


Re: [VOTE] MADlib v1.11-rc2

2017-05-03 Thread Rahul Iyer
Re: incorrect git revision in the files

The revision string is obtained using
​
​​
 'git describe' and the value of
​`​
rel/v1.10.0-30-g0ff829a
​`
 indicates that the
​ commit is
​30 commits above the
v1.10.0
commit, with the commit SHA starting with
0ff829a
​. The difficulty with ensuring it contains the `
rel/v1.11` tag
is that we don't yet have a
v1.11
release. The release tag can only be finalized after it has been
successfully voted upon. Since the release tags on apache are immutable, we
can't push them out before voting.

The DMGs are built on the release manager's local machine, so we can have
local tags to get the right string.
The RPMs, however, are built on Jenkins/other CI server which only contain
the remote tags. The best we could do is have `
rc/v1.11-rc2
` instead of the current string.

- Rahul


On May 3, 2017 9:32 AM, "Frank McQuillan"  wrote:

Ed,


Thanks for your review,  all comments big and small certainly encouraged
and welcome.

Regarding the JIRAs that are not closed, the actual work has been done so
there is nothing material pending.  But I did not close them because I
wanted @rvs to do that, since he was the one overseeing them.  I will ask
him to close them at his earliest convenience.

Frank

On Wed, May 3, 2017 at 8:58 AM, Ed Espino  wrote:

> Sorry about the piecemeal observations. I'm currently in Beijing and don't
> have a lot of extra large time chunks to review the release in one sitting.
>
> 1) There are still three outstanding Jira issues in an "Unresolved" state
> with a fix version of v1.11.  Are they going to be resolved soon? They can
> be seen with the following url:
>
> https://issues.apache.org/jira/browse/MADLIB/fixforversion/1
> 2339592/?selectedTab=com.atlassian.jira.jira-projects-plugin
> :version-summary-panel
>
> 2) As it relates to the convenience binary release, I noticed an
> inconsistent MADLIB_GIT_REVISION value (rel/v1.10.0) spread throughout
> several SQLCommon.m4 files. Shouldn't the reference be to v1.11 instead of
> v1.10?
>
> 
> MAC (notice rel/v1.10.0-30-g0ff829a value):
> 
>
> ✔ /usr/local/madlib/Versions
> 23:42 $ grep -n -i -r MADLIB_GIT_REVISION *
> 1.11/ports/greenplum/modules/utilities/utilities.sql_in:122:'git
> revision: __MADLIB_GIT_REVISION__, '
> 1.11/ports/hawq/modules/utilities/utilities.sql_in:122:'git
> revision: __MADLIB_GIT_REVISION__, '
> 1.11/ports/postgres/9.4/madpack/SQLCommon.m4:20:m4_define(`_
> _MADLIB_GIT_REVISION__',
> `rel/v1.10.0-30-g0ff829a')
> 1.11/ports/postgres/9.5/madpack/SQLCommon.m4:20:m4_define(`_
> _MADLIB_GIT_REVISION__',
> `rel/v1.10.0-30-g0ff829a')
> 1.11/ports/postgres/9.6/madpack/SQLCommon.m4:20:m4_define(`_
> _MADLIB_GIT_REVISION__',
> `rel/v1.10.0-30-g0ff829a')
> 1.11/ports/postgres/modules/utilities/utilities.sql_in:122:'git
> revision: __MADLIB_GIT_REVISION__, '
>
> 
> Linux (notice rel/v1.10.0-31-gd54be2b value):
> 
>
> [root@ip-172-31-9-242 Versions]# rpm -qa | grep madlib
> madlib-1.11-1.x86_64
> [root@ip-172-31-9-242 Versions]# pwd
> /usr/local/madlib/Versions
> [root@ip-172-31-9-242 Versions]# grep -n -i -r MADLIB_GIT_REVISION *
> 1.11/ports/greenplum/4.2/madpack/SQLCommon.m4:20:m4_define(`
> __MADLIB_GIT_REVISION__',
> `rel/v1.10.0-31-gd54be2b')
> 1.11/ports/greenplum/4.3/madpack/SQLCommon.m4:20:m4_define(`
> __MADLIB_GIT_REVISION__',
> `rel/v1.10.0-31-gd54be2b')
> 1.11/ports/greenplum/4.3ORCA/madpack/SQLCommon.m4:20:m4_defi
> ne(`__MADLIB_GIT_REVISION__',
> `rel/v1.10.0-31-gd54be2b')
> 1.11/ports/greenplum/modules/utilities/utilities.sql_in:122:'git
> revision: __MADLIB_GIT_REVISION__, '
> 1.11/ports/hawq/2/madpack/SQLCommon.m4:20:m4_define(`__MADLI
> B_GIT_REVISION__',
> `rel/v1.10.0-31-gd54be2b')
> 1.11/ports/hawq/modules/utilities/utilities.sql_in:122:'git
> revision: __MADLIB_GIT_REVISION__, '
> 1.11/ports/postgres/9.5/madpack/SQLCommon.m4:20:m4_define(`_
> _MADLIB_GIT_REVISION__',
> `rel/v1.10.0-31-gd54be2b')
> 1.11/ports/postgres/9.6/madpack/SQLCommon.m4:20:m4_define(`_
> _MADLIB_GIT_REVISION__',
> `rel/v1.10.0-31-gd54be2b')
> 1.11/ports/postgres/modules/utilities/utilities.sql_in:122:'git
> revision: __MADLIB_GIT_REVISION__, '
> [root@ip-172-31-9-242 Versions]#
>
> On Wed, May 3, 2017 at 12:15 PM, Ed Espino  wrote:
>
> > I have taken a quick look at the DMG and a Linux RPM binary artifacts
> > (sorry haven't had time to build and/or test the binaries yet). But this
> > info might be of some benefit to the team sooner than later.
> >
> > Regards,
> > -=e
> > --
> > *Ed Espino*
>
> >
> > ==
> > PGP signature (source and convenience binaries): good
> > ==
> > Hashes 

Re: [VOTE] MADlib v1.11-rc1

2017-05-01 Thread Rahul Iyer
+1

On May 1, 2017 3:55 PM, "Rashmi Raghu"  wrote:

> Hello MADlib community,
>
> We have created a MADlib 1.11 RC-1, with the artifacts below up for a vote.
>
> This will be the 5th release for Apache MADlib (incubating).
>
> The main goals of this release are:
> * new module (PageRank for graph analytics with grouping support included)
> * improvements to existing modules (add grouping support to Single Source
> Shortest Path, reduce memory footprint of DT and RF, include NULL features
> in training DT, add support for array and svec output for Pivot module,
> utility to unnest 2-D arrays into rows of 1-D arrays)
> * platform updates (GPDB 5)
> * updates for Apache Top Level Project readiness and build process on
> Apache infrastructure
> * bug fixes
> * doc improvements
>
> For more information including release notes, please see:
> https://cwiki.apache.org/confluence/display/MADLIB/MADlib+1.11
>
> *** Please download, review and vote by Thu May 04, 2017 @ 6pm PDT ***
>
> We're voting upon the source (tag):  rc/1.11-rc1
> https://github.com/apache/incubator-madlib/tree/rc/1.11-rc1
>
> Source Files:
> https://dist.apache.org/repos/dist/dev/incubator/madlib/1.
> 11-incubating-rc1/
>
> Commit to be voted upon:
> https://github.com/apache/incubator-madlib/commit/
> 0ff829a7060d08f284e8468ebf35c31b6e231d58
>
> KEYS file containing PGP Keys we use to sign the release:
> https://dist.apache.org/repos/dist/dev/incubator/madlib/KEYS
>
> To help in tallying the vote, PMC members please be sure to indicate
> "(binding)" with the vote.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
>
> Regards,
> Rashmi Raghu
>
> --
> Rashmi Raghu, Ph.D.
> Pivotal Data Science
>


Re: understanding hello_world module example

2017-04-12 Thread Rahul Iyer
> Why is inOtherState.numRows being typecast to uint16_t here? Shouldn't
> it be uint64_t?
>
> I'd like to know if this is some magic I need to understand or simply a bug.

That's a bug. Thanks for pointing it out.


Re: Apache Jenkins MADlib projects

2017-03-14 Thread Rahul Iyer
Thanks, Ed.

The master and PR integration would be quite useful for MADlib and are on
the cards. We're in the process of wrapping our docker work; once that goes
in, we can finalize these other projects.
It would be easier for us to start with the HAWQ projects as references -
could you please post their links?

Best,
iR

On Tue, Mar 14, 2017 at 8:15 AM, Ed Espino  wrote:

> I see Apache Jenkins build service testing in madlib-test-build
>  is being worked on.
> This
> is pretty cool for the dev community. Is there a set of projects and GitHub
> *master* branch and *Pull Request* (PR) integration points being worked on?
>
> For what it is worth, here are some integration points we have for the HAWQ
> project that may be of use to MADlib:
>
>- For each Pull Request (PR), perform the following checks (these go
>along with the default conflict check performed automatically by
> github):
>   - Perform build (compilation) and Apache Release Audit Tool (RAT)
>   check
>- For each master branch submission:
>   - Perform build (compilation)
>   - Perform Apache Release Audit Tool (RAT) check
>   - Add "Embeddable Build Status Icon" to the project's README.md:
>   https://builds.apache.org/job/madlib-test-build/badge/
>
> Cheers,
> -=e
>
> --
> *Ed Espino*
>


Re: [VOTE] MADlib v1.10-rc2

2017-03-03 Thread Rahul Iyer
+1

On Fri, Mar 3, 2017 at 11:17 AM, Frank McQuillan 
wrote:

> Hello MADlib community,
>
> I am sending this email on behalf of the release manager Satoshi Nagayasu <
> sn...@uptime.jp> .
>
> We have created a MADlib 1.10 RC-2, with the artifacts below up for a vote.
>
> From project mentor Roman Shaposhnik we heard the ultimate resolution on
> the IP issue:
>* we don't do anything with existing (BSD) files even if we edit them
>* every new file we create gets an ASF license header
>* more details:
>
> https://issues.apache.org/jira/browse/LEGAL-293?focusedCommentId=15881595;
> page=com.atlassian.jira.plugin.system.issuetabpanels:
> comment-tabpanel#comment-15881595
>
> RC-2 replaces RC-1 with the following changes:
>
> * Multiple: Update license headers per Apache guidance
> https://github.com/apache/incubator-madlib/commit/
> a3863b6c2407eb28ba007f6288d167bf88674e6d
>
> * Build: Fix module sort order for PGXN installation
> https://github.com/apache/incubator-madlib/commit/
> fa80240f72a6551c2ee567d471afa499fd1d1efe
>
> * Update the copyright year.
> https://github.com/apache/incubator-madlib/commit/
> 0b8415e7eec5c9ebb83fbf22923c69a99b0056ef
>
> * Build: Add error for missing server includedir
> https://github.com/apache/incubator-madlib/commit/
> b3495c50bf491139ac245a21d97963e81892c610
>
> * Encode categorical: Add distributed_by in Postgresql w/ no-op
> https://github.com/apache/incubator-madlib/commit/
> 7055dceb3fbde35bae602ac80d4b70486f015748
>
> * Renamed the top level source directory as suggested:
> apache-madlib-src-1.10-incubating
>
> This will be the 4th release for Apache MADlib (incubating).
>
> The main goals of this release are:
> * new modules (single source shortest path for graph analytics, encode
> categorical variables, K-nearest neighbors)
> * improvements to existing modules (add grouping support to elastic
> net and PCA, add cross validation to elastic net, array input for
> K-means, verbose output option for DT and RF, limit itemset size in
> association rules, various madpack installer improvements)
> * platform updates (PostgreSQL 9.6)
> * bug fixes
> * doc improvements
>
> For more information including release notes, please see:
> https://cwiki.apache.org/confluence/display/MADLIB/MADlib+1.10
>
> *** Please download, review and vote by Mon Mar 6, 2017 @ 6pm Pacific Time
> USA ***
>
> We're voting upon the source (tag):  rc/1.10.0-rc2
> https://github.com/apache/incubator-madlib/tree/rc/1.10.0-rc2
>
> Source Files:
> https://dist.apache.org/repos/dist/dev/incubator/madlib/1.
> 10.0-incubating-rc2/
>
> Commit to be voted upon:
> https://github.com/apache/incubator-madlib/commit/
> a3863b6c2407eb28ba007f6288d167bf88674e6d
>
> KEYS file containing PGP Keys we use to sign the release:
> https://dist.apache.org/repos/dist/dev/incubator/madlib/KEYS
>
> To help in tallying the vote, can PMC members please be sure to
> indicate "(binding)" with their vote.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Regards,
> Frank McQuillan
>


Re: [VOTE] MADlib v1.10-rc1

2017-02-27 Thread Rahul Iyer
I have attached two files:

new_files_after_apache.txt: New files added since September 15, 2015 (grant
date) till date
files_w_apache_header.txt: Files that contain the Apache header right now.

Comparing the two lists, there are open questions regarding below files.

Extra headers:
- sort-module.py has Apache header but was created before grant (recently
edited and header added). *I'll fix this*.
- create_indicators.* have headers but were renamed from
data_preparation.*. *What is the legal guidance with this*?

No header:
- class_diagram.mp looks like a text file with no header, even though it
was added just after the grant. I'm not aware of the purpose of this file.



On Mon, Feb 27, 2017 at 4:42 PM, Frank McQuillan 
wrote:

> OK, so we need to go back and do the comparison from the original code
> grant in the fall of 2015 to the  current 1.10 release candidate.
>
> On Mon, Feb 27, 2017 at 4:19 PM, Roman Shaposhnik 
> wrote:
>
> > Frank, I'm not sure I understand the question. The criteria needs to hold
> > for anything that came in via the initial code ingest compared to how the
> > master of your project looks now.
> >
> > Thanks,
> > Roman.
> >
> > On Mon, Feb 27, 2017 at 4:10 PM, Frank McQuillan 
> > wrote:
> > > Roman,
> > >
> > > Does this apply retro-actively back to initial grant of the code to
> > ASF?  Or
> > > just from the last release 1.9.1?
> > >
> > > Frank
> > >
> > > On Sun, Feb 26, 2017 at 11:23 PM, Roman Shaposhnik <
> ro...@shaposhnik.org
> > >
> > > wrote:
> > >>
> > >> Here's the ultimate resolution on the IP issue:
> > >>* we don't do anything with existing (BSD) files even if we edit
> them
> > >>* every new file we create gets an ASF license header
> > >>
> > >> More details:
> > >>
> > >> https://issues.apache.org/jira/browse/LEGAL-293?
> > focusedCommentId=15881595=com.atlassian.jira.
> > plugin.system.issuetabpanels:comment-tabpanel#comment-15881595
> > >>
> > >> Thanks,
> > >> Roman.
> > >>
> > >> On Tue, Feb 21, 2017 at 5:54 PM, Frank McQuillan <
> fmcquil...@pivotal.io
> > >
> > >> wrote:
> > >> > Thanks Roman for working on this.
> > >> >
> > >> > If you feel a final answer will be ready next week, then yes by all
> > >> > means l
> > >> > would suggest to the community that we wait and re-spin an RC2 with
> > the
> > >> > license headers issue resolved.  Seems less overhead and effort
> than a
> > >> > quick follow on release right after 1.10.  Also, there some momentum
> > >> > going
> > >> > with the legal discussion, so let's take advantage of that.
> > >> >
> > >> > Satoshi (release manager), are you OK pausing the RC2 until we hear
> > back
> > >> > from Roman next week?
> > >> >
> > >> > Thank you,
> > >> > Frank
> > >> >
> > >> >
> > >> > On Tue, Feb 21, 2017 at 4:45 PM, Roman Shaposhnik <
> > ro...@shaposhnik.org>
> > >> > wrote:
> > >> >
> > >> >> On Tue, Feb 21, 2017 at 2:55 PM, Frank McQuillan
> > >> >> 
> > >> >> wrote:
> > >> >> > Agree with Rahul re putting up an RC2 with the suggested changes
> > from
> > >> >> Roman,
> > >> >> > including incorporating Ed's comments on copyright year and top
> > level
> > >> >> folder
> > >> >> > naming.  These are really items but let's respond to the RC1
> > >> >> > reviewers
> > >> >> the
> > >> >> > best way we can.
> > >> >>
> > >> >> +1 to a respin.
> > >> >>
> > >> >> > Regarding the ASF legal issue being discussed, MADLib community
> is
> > >> >> > more
> > >> >> than
> > >> >> > happy to respond to any guidance from the fine folks at the ASF
> > >> >> > around
> > >> >> > headers with appropriate licensing verbage.  We just need to know
> > >> >> > what
> > >> >> that
> > >> >> > guidance is.
> > >> >>
> > >> >> Well, if you're ok respinning next week I hope to get you a final
> > >> >> answer by then.
> > >> >> Might as well kill two birds with the same RC. Or we can quickly
> do a
> > >> >> follow up
> > >> >> release once the licensing headers dust settles. Up to you guys.
> > >> >>
> > >> >> Thanks,
> > >> >> Roman.
> > >> >>
> > >
> > >
> >
>
pom.xml
examples/hello_world/iterative/simple_logistic.cpp
examples/hello_world/iterative/simple_logistic.hpp
examples/hello_world/iterative/simple_logistic.py_in
examples/hello_world/non-iterative/avg_var.cpp
examples/hello_world/iterative/simple_logistic.sql_in
examples/hello_world/non-iterative/avg_var.hpp
examples/hello_world/non-iterative/avg_var.sql_in
src/madpack/changelist_1.8_1.10.yaml
src/madpack/changelist_1.9_1.10.yaml
src/madpack/sort-module.py
doc/design/modules/graph.tex
src/modules/utilities/path.cpp
src/modules/utilities/path.hpp
methods/stemmer/src/pg_gp/porter_stemmer.c
src/modules/utilities/utilities.hpp
methods/stemmer/src/pg_gp/porter_stemmer.sql_in
src/ports/greenplum/5/CMakeLists.txt
src/ports/greenplum/cmake/FindGreenplum_5.cmake
src/ports/hawq/2/CMakeLists.txt
src/ports/hawq/cmake/FindHAWQ_2.cmake
src/ports/postgres/9.5/CMakeLists.txt

Re: [VOTE] MADlib v1.10-rc1

2017-02-16 Thread Rahul Iyer
+1

On Wed, Feb 15, 2017 at 7:27 PM, Satoshi Nagayasu  wrote:

> Hello MADlib community,
>
> We have created a MADlib 1.10 RC-1, with the artifacts below up for a vote.
>
> This will be the 4th release for Apache MADlib (incubating).
>
> The main goals of this release are:
> * new modules (single source shortest path for graph analytics, encode
> categorical variables, K-nearest neighbors)
> * improvements to existing modules (add grouping support to elastic
> net and PCA, add cross validation to elastic net, array input for
> K-means, verbose output option for DT and RF, limit itemset size in
> association rules, various madpack installer improvements)
> * platform updates (PostgreSQL 9.6)
> * bug fixes
> * doc improvements
>
> For more information including release notes, please see:
> https://cwiki.apache.org/confluence/display/MADLIB/MADlib+1.10
>
> *** Please download, review and vote by Sat Feb 18, 2017 @ 6pm PST ***
>
> We're voting upon the source (tag):  rc/1.10.0-rc1
> https://github.com/apache/incubator-madlib/tree/rc/1.10.0-rc1
>
> Source Files:
> https://dist.apache.org/repos/dist/dev/incubator/madlib/1.
> 10.0-incubating-rc1/
>
> Commit to be voted upon:
> https://github.com/apache/incubator-madlib/commit/
> ea17530bfe22a1fde173d7fa83508cbcd9924c20
>
> KEYS file containing PGP Keys we use to sign the release:
> https://dist.apache.org/repos/dist/dev/incubator/madlib/KEYS
>
> To help in tallying the vote, can PMC members please be sure to
> indicate "(binding)" with their vote.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> --
> Satoshi Nagayasu 
>


Re: Status of on-going PRs

2017-01-31 Thread Rahul Iyer
Hi Satoshi,

Thanks for compiling this list. Please find my comments inline.

On Tue, Jan 31, 2017 at 3:04 AM, Satoshi Nagayasu  wrote:

> Hi all,
>
> As release manager for 1.10, I just did a quick review and created a status
> list of the on-going PRs.
>
> https://github.com/apache/incubator-madlib/pulls
>
> If you have comments, please let me know. I will update the status.
>
> Status of the PRs
> -
> Use relative path for installation in GPDB/HAWQ #94
>   -> Need to be tested with GPDB/HAWQ.
>
> Build: Use only major version for GPDB 5, HAWQ 2 #91
>   -> Need review?
>
​Testing is complete for both PRs. Requires a review.
​

> Allow encode_categorical_variables() to use the svec type. #93
>   -> Need more work by the developer (me).
>
​This would be better merged within the 1.10 release.
Adding it to the next version would require special handling by upgrade
since there is a change in argument type (hence requiring drop/replace
during upgrade).


>  K-means: support for array input #89
>   -> Need more review, or ready for committer?
>
​This looks ready to merge. ​

>
> JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc #81
>   -> Need more work by the developer.
>
> HAWQ2.1: Changes the cmake to assume any HAWQ 2.X system is 2.0 and #79
>   -> Need review, or ready for committer?
>
​This is superseded by #91 and will be closed with it. ​


> Include boost::format in MathToolkit_impl.hpp. #76
>   -> Already merged. The PR can be closed.
>
​I forgot to close this with the commit message and can only be manually
closed by the contributor. If not closed soon, I'll close it with a future
commit.
​

> SVM: Implement c++ functions for training multi-class svm in mini-batch #75
>   -> The doc needs to be updated?
>
​This requires substantial more work and discussion as the scope of the
work is not defined. We will have to ​release without it.



>
> Regards,
> --
> Satoshi Nagayasu 
>


Re: full Python API for MADlib

2016-12-23 Thread Rahul Iyer
​Hi Fatima,

Thanks for starting this work.
​

> When you get back to work please let me know if you are OK with me forking
> this code base, or if you are thinking to make any important changes.
>
> ​I think forking Pymadlib code and making it work for your use case would
be great first step. I do have thoughts on what Python API for MADlib
should ​achieve and how that can be designed, but that's a bigger
discussion we can have after we've made initial progress.

An important step in building the API is consistency esp. with external
API. For that purpose following scikit-learn's structure would be helpful.
We can go over specifics after initial progress.


> Am also watching the videos so I get more familiar with available
> algorithms.
> In my next available time slot will install it all so am ready to start.
>

​The videos + getting ​Pymadlib to work are the right goals to start with.
Feel free to let this forum know about progress and questions.

Best,
Rahul
​

> 2016-12-22 17:49 GMT-03:00 Fatima Castiglione Maldonado 发 <
> castiglionemaldon...@gmail.com>:
>
>> Great, thanks. We will check it once I read a bit more and then we can
>> talk.
>>
>> 2016-12-22 17:28 GMT-03:00 Frank McQuillan :
>>
>>> This is an early attempt at a Python interface for MADlib
>>> https://github.com/pivotalsoftware/pymadlib
>>> but I would say it is preliminary in nature and we may want to take a
>>> different approach.
>>>
>>> @riyer can provide more details on this.
>>>
>>> Frank
>>>
>>>
>>> On Thu, Dec 22, 2016 at 12:02 PM, Fatima Castiglione Maldonado 发 <
>>> castiglionemaldon...@gmail.com> wrote:
>>>
 I am reading documentation:

 Apache MADlib (Incubating)
 Created by Gavin, last modified by Frank McQuillan on Sep 13, 2016
 https://cwiki.apache.org/confluence/pages/viewpage.action?pa
 geId=61319606

 and watching videos:

 Pivotal Open Source Hub - Watch!
 https://www.youtube.com/channel/UCLQV6NlSaIZBGym1mEczuqQ

 This should take me a few days.
 If there is other relevant material please let me know.

 Best,
 Fatima




 2016-12-20 23:00 GMT-03:00 Fatima Castiglione Maldonado 发 <
 castiglionemaldon...@gmail.com>:

> Hi Frank,
>
> thanks for your kind answer. Yeah, just yesterday night I saw the
> user's survey results, plus today I read some "top ten python APIs" doc 
> and
> MADlib was not there, so two plus two four.
>
> I would like to use the python API in a POC. My main expertise is
> coding, so would probably need some help with architecture design.
>
> Best,
> Fatima
>
>
>
>
>
>
> 2016-12-20 22:54 GMT-03:00 Frank McQuillan :
>
>> Hi Fatima,
>>
>> Thank you for your email and offer to participate.
>>
>> A Python API for MADlib is something that a lot of people have been
>> asking for.  In the recent MADlib survey, it was one of the top requests.
>>
>> There is already an R interface
>> https://cran.r-project.org/web/packages/PivotalR/
>> which could be a model for how to approach the Python interface.
>>
>> There have been some prototyping done, but the feature is a new one
>> for Python API.
>>
>> But before we get into details, would you be able to say a little bit
>> about your planned use for the Python API.  Also, what aspect would you 
>> be
>> interested in working on? - architecture, core software development,
>> testing, etc.
>>
>> Regards,
>> Frank
>>
>> On Tue, Dec 20, 2016 at 1:28 PM, Fatima Castiglione Maldonado 发 <
>> castiglionemaldon...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> does anybody know how to participate in the "full Python API for
>>> MADlib" effort? I would like to give a helping hand.
>>>
>>> Best,
>>> Fatima
>>>
>>>
>>>
>>> --
>>> =
>>> Fátima Castiglione Maldonado
>>> Singer, Designer, Creative, Artificial Intelligence
>>> Cantante, Diseñadora, Creativa, Inteligencia Artificial
>>> castiglionemaldon...@gmail.com
>>>
>>>  
>>>,'_   |
>>>  __|__|__|__
>>> <_  )_.--._
>>>   `---,--.-'  ,-'  `-.
>>>  ||  |  ,'`.
>>> ,'|  |,'`.
>>> |  _,-'  |__ /\
>>>   _,'-'`.   `---.___|_ \
>>>   .--'  -.  | _   `-. - |
>>>   |___|  |  |  \  ,- \  |
>>>   |___|  |===((|) | |
>>>   |

Re: [GitHub] incubator-madlib issue #80: KNN Added

2016-12-17 Thread Rahul Iyer
Hi Auon,

It looks like the PR was closed from your end. If you didn't close it on
github, then it could have been closed if that branch was deleted.
In any case, it's good that you have all the changes you need to make. If
you need, you can still access the PR with the comments on github
.

Best,
Rahul

On Fri, Dec 16, 2016 at 10:52 PM, Kazmi,Auon H  wrote:

> Hi NJ,
>
> I guess I was able to play around with branching and other stuff but my PR
> got deleted from madlib's repo. But that's okay as I have got the comments
> you made, in e-mails. I will work on them from tomorrow.
>
>
> Thanks for your help!
>
>
> Thanks,
>
> Auon
>
> 
> From: Kazmi,Auon H 
> Sent: Friday, December 16, 2016 11:09:11 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: [GitHub] incubator-madlib issue #80: KNN Added
>
> Hi NJ,
>
> Thanks for your detailed reply!
>
> I will try to do the said things.
>
>
>
> Thanks,
>
> Auon
>
> 
> From: Nandish Jayaram 
> Sent: Friday, December 16, 2016 8:32:52 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: [GitHub] incubator-madlib issue #80: KNN Added
>
> Hi Auon,
>
> Hope your exams went well.
>
> You can do whatever ends up being a better git-learning experience for you.
> Since you just started contributing to MADlib, the easier way to get going
> might be to do what you mentioned. But a better, though a longer way, would
> be to just mess around with branches as a learning experience. For instance
> (be warned, this might not be the best approach and it might sound
> daunting), you can do the following:
> - Create a new local branch (say the branch name is temp-features/knn)
> while on your current master branch (which already has the knn code changes
> in it).
> useful commands: git checkout -b temp-features/knn
> - Go back to your master branch and reset it to the commit SHA before you
> made changes for knn (look at git log command to find the appropriate
> commit SHA).
> useful commands: git log, git reset --hard  (be careful while
> using the --hard flag in general).
> - You essentially want to reach a state where the new branch features/knn
> has the code changes you have made so far for the knn feature, and your
> master branch must be in sync with apache/incubator-madlib's master branch.
> You ideally want your local master to be in sync with your repo master,
> which in turn must be in sync with origin master (apache/incubator-madlib).
> - You might also want to push your master (with --force option) to your
> remote repo, to undo the changes you have made to your repo master branch
> with the previous PR.
> useful commands: git push --force 
> - Now create a new branch off your master (say branch name features/knn).
> Rebase this new branch with the temp-features/knn branch. You will get the
> knn related changes back on this branch now.
> useful commands: git checkout -b features/knn, git rebase temp-features/knn
> - Address the comments on this PR, and then push the features/knn branch to
> your repo and open a new PR on the branch. Read about git rebase (and try
> using it) before pushing the branch.
> useful commands: (on master branch), git pull --ff-only, (on features/knn
> branch) git rebase -i master
>
> The useful commands I have mentioned might not do the needful for each
> step. They are just pointers for you. There might be a much more simpler
> way to accomplish all this, and I have no idea if this way would actually
> work correctly. :) But you can (almost) always recover from any mistake you
> make on git.
>
> NJ
>
> On Fri, Dec 16, 2016 at 2:57 PM, Kazmi,Auon H  wrote:
>
> > HI NJ,
> >
> > Thanks for your input!
> >
> > Sorry, I was busy with my semester-end exams.
> >
> > I am reading on Git. Should I repeat the process of checking out madlib
> > repo and then again making changes in a local branch?
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > 
> > From: njayaram2 
> > Sent: Thursday, December 15, 2016 6:24:08 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: [GitHub] incubator-madlib issue #80: KNN Added
> >
> > Github user njayaram2 commented on the issue:
> >
> > https://github.com/apache/incubator-madlib/pull/80
> >
> > This is a great start!
> > I will provide some github-specific feedback here, and more
> > knn-specific
> > comments in the code.
> > Git can be daunting to use at first, but it's great once you get a
> > hang of it.
> > I would recommend you go through the following wonderful book if you
> > have not already done so:
> > https://git-scm.com/book/en/v2
> >
> > When you work on a feature/bug, it is best if you create a branch
> > locally
> > and make all changes for that feature there. You can then push that
> > branch
> > into your github repo 

Re: Spatial model in MADlib (GWR)

2016-09-13 Thread Rahul Iyer
Hi Chengliang

There's some information on debugging
​in ​
our old wiki page

​. There's no example there but the process is simple once you have the
server process id.

- Rahul ​

On Tue, Sep 13, 2016 at 6:41 AM, Wang ChenLiang  wrote:

> Hi Frank,
>
> I was being on a business trip for several months and began to work on
> MADlib again in the past few days. But I have a trouble with debugging
> MADlib with GDB. Could you kindly give me a detailed example for
> debugging MADlib with CodeBlocks or GDB?
>
> Many Thanks !
>
>
> On 03/15/2016 12:31 AM, Frank McQuillan wrote:
> > OK.  Please don't hesitate to ask if you have any questions.
> >
> > Frank
> >
> > On Mon, Mar 14, 2016 at 4:17 AM, chenliang wang 
> wrote:
> >
> >> Hi, Frank
> >>
> >> Recently,I am just looking at the detail of development guide and trying
> >> to complete the serial algorithm. And I plan to implement GWR dividing
> >> the loop into pieces of chunks executed in several nodes. However, I am
> >> not sure if there are some specials details need to be designed for
> >> distributed models in GPDB because I haven't developed model in MPP
> >> architecture. I hope this distributed manner would be implemented
> easily.
> >>
> >> Best,
> >> Chenliang Wang
> >>
> >> On 03/10/2016 08:33 AM, Frank McQuillan wrote:
> >>> Hi ChenLiang Wang,
> >>>
> >>> I am checking to see how things are going regarding the GWR model for
> >>> MADlib that you proposed.  Not sure which phase you are at, but a
> >> suggested
> >>> next step might be how you plan to implement the GWR algorithm in a
> >>> distributed manner.  That is, how will it run in parallel?
> >>>
> >>> (Starting as a new thread since the previous thread fragmented.)
> >>>
> >>> Regards,
> >>> Frank
> >>>
> >>
> >
>


Re: [VOTE] MADlib v1.9.1-rc2

2016-09-06 Thread Rahul Iyer
+1

On Tue, Sep 6, 2016 at 9:52 AM, Feng, Xixuan (Aaron) 
wrote:

> +1
>
> Thank you guys for all the hard work!
>
> ​
>


Re: [VOTE] MADlib v1.9.1-rc1

2016-09-02 Thread Rahul Iyer
Note to all using OSX for future archives: use COPYFILE_DISABLE=1 during
the tar to avoid copying the '._' files

» COPYFILE_DISABLE=1 tar czf apache-madlib-1.9.1-incubating-source.tar.gz
apache-madlib-1.9.1-incubating-source

On Fri, Sep 2, 2016 at 9:28 AM, Frank McQuillan <fmcquil...@pivotal.io>
wrote:

> Thanks.  I will re-send the [VOTE] request.
>
> On Fri, Sep 2, 2016 at 9:25 AM, Rahul Iyer <ri...@pivotal.io> wrote:
>
> > New RC uploaded with source files at
> > https://dist.apache.org/repos/dist/dev/incubator/madlib/1.9.
> > 1-incubating-rc2/
> >
> > Everything else remains the same.
> >
> > On Fri, Sep 2, 2016 at 8:59 AM, Frank McQuillan <fmcquil...@pivotal.io>
> > wrote:
> >
> > > I think that is the safest approach, to create a new RC.
> > >
> > > Let us cancel the vote on RC-1 and when RC-2 is posted, I will call
> for a
> > > new vote
> > >
> > > Thank you Satoshi for catching this.
> > >
> > > Frank
> >
>


Re: [VOTE] MADlib v1.9.1-rc1

2016-09-02 Thread Rahul Iyer
cubating).
> >>
> >> The main goals of this release are:
> >> * new modules (1-class SVM for novelty detection, prediction metrics,
> >> sessionization, pivoting)
> >> * improvements to existing modules (class weights in SVM, overlapping
> >> patterns in path)
> >> * performance improvements (path)
> >> * platform updates (PostgreSQL 9.5 and 9.6)
> >> * bug fixes
> >> * doc improvements
> >>
> >> For more information including release notes, please see:
> >> https://cwiki.apache.org/confluence/display/MADLIB/MADlib+1.9.1
> >>
> >> *** Please download, review and vote by Tues Sep 6, 2016 @ 6pm PST ***
> >>
> >> We're voting upon the source (tag):  rc/1.9.1-rc1
> >>
> >> Source Files:
> >> https://dist.apache.org/repos/dist/dev/incubator/madlib/1.9.
> 1-incubating-rc1/
> >>
> >> Commit to be voted upon:
> >> https://git-wip-us.apache.org/repos/asf?p=incubator-madlib.
> git;a=commit;h=e1c99c1538dc124c9b323ba76382ba2af05c6892
> >>
> >> KEYS file containing PGP Keys we use to sign the release:
> >> https://dist.apache.org/repos/dist/dev/incubator/madlib/KEYS
> >>
> >> To help in tallying the vote, can PMC members please be sure to indicate
> >> "(binding)" with their vote.
> >>
> >> [ ] +1  approve
> >> [ ] +0  no opinion
> >> [ ] -1  disapprove (and reason why)
> >>
> >> Thank you,
> >> Frank McQuillan
> >
> >
> >
> > --
> > Satoshi Nagayasu <sn...@uptime.jp>
>
>
>
> --
> Satoshi Nagayasu <sn...@uptime.jp>
>



-- 

-
Rahul Iyer
Principal software engineer | Predictive Analytics

*Pivotal**A new platform for a new era*


Re: [VOTE] MADlib v1.9.1-rc1

2016-09-01 Thread Rahul Iyer
+1

On Thu, Sep 1, 2016 at 12:17 PM, Frank McQuillan <fmcquil...@pivotal.io>
wrote:

> Hello MADlib community,
>
> We have created a MADlib 1.9.1 release candidate, with the artifacts below
> up for a vote.
>
> This will be the 3rd release for Apache MADlib (incubating).
>
> The main goals of this release are:
> * new modules (1-class SVM for novelty detection, prediction metrics,
> sessionization, pivoting)
> * improvements to existing modules (class weights in SVM, overlapping
> patterns in path)
> * performance improvements (path)
> * platform updates (PostgreSQL 9.5 and 9.6)
> * bug fixes
> * doc improvements
>
> For more information including release notes, please see:
> https://cwiki.apache.org/confluence/display/MADLIB/MADlib+1.9.1
>
> *** Please download, review and vote by Tues Sep 6, 2016 @ 6pm PST ***
>
> We're voting upon the source (tag):  rc/1.9.1-rc1
>
> Source Files:
> https://dist.apache.org/repos/dist/dev/incubator/madlib/1.9.
> 1-incubating-rc1/
>
> Commit to be voted upon:
> https://git-wip-us.apache.org/repos/asf?p=incubator-madlib.git;a=commit;h=
> e1c99c1538dc124c9b323ba76382ba2af05c6892
>
> KEYS file containing PGP Keys we use to sign the release:
> https://dist.apache.org/repos/dist/dev/incubator/madlib/KEYS
>
> To help in tallying the vote, can PMC members please be sure to indicate
> "(binding)" with their vote.
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Thank you,
> Frank McQuillan
>



-- 

-
Rahul Iyer
Principal software engineer | Predictive Analytics

*Pivotal**A new platform for a new era*


Re: create_indicator_variables() with svec?

2016-08-09 Thread Rahul Iyer
Thanks, Satoshi.

Created feature request (MADLIB-1013
<https://issues.apache.org/jira/browse/MADLIB-1013>), this will probably go
into 1.9.2 since we've already started the release process for 1.9.1.

On Mon, Aug 8, 2016 at 7:21 PM, Satoshi Nagayasu <sn...@uptime.jp> wrote:

> Hi Rahul,
>
> 2016-08-09 2:05 GMT+09:00 Rahul Iyer <ri...@pivotal.io>:
> > Array output for *create_indicator_variables* would be quite helpful when
> > number of categories is large and the svec representation would be ideal
> > for it. There might be similar implications for *pivoting*, but we can
> keep
> > that as future discussion.
>
> Sounds great.
>
> > I'm curious about how you're using the indicator variables - svec is not
> > widely supported in MADlib (yet) and might not give much benefit after
> the
> > encoding is complete.
>
> I'm trying to implement some recommendation or similarity search stuff
> for several media items (movies, books, documents, else) with its metadata.
> It has several categorical variables, such as authors, publishers,
> actors/actresses, genres, else. Some of them have many categories.
>
> BTW, I'm a starter of data-mining and machine-learning, not having much
> experience.
>
> Of course, I can reduce number of those categories, but playing with raw
> data would be more fun. :)
>
> Regards,
> --
> Satoshi Nagayasu <sn...@uptime.jp>
>



-- 

-
Rahul Iyer
Principal software engineer | Predictive Analytics

*Pivotal**A new platform for a new era*


Re: Contributing GMM and Perceptron to MADLib

2016-03-28 Thread Rahul Iyer
I can assign this to you, but you need to have an account in
https://issues.apache.org.
If you already have an account, then please send your id - I wasn't able to
find you just using your name.

On Mon, Mar 28, 2016 at 3:31 PM, Aditya Nain <adityana...@gmail.com> wrote:

> Hi Rahul,
>
> Thanks for the reply!
>
> I am working on implementing Gaussian Mixture Model assuming that the
> co-variance matrix is same for all the Gaussians.
> The JIRA which deals GMM is MADBLIB-410:
> https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB
>
> Can this be assigned to me, or how do I get it assigned to me?
>
> Thanks,
> Aditya
>
> On Mon, Mar 21, 2016 at 3:41 PM, Rahul Iyer <ri...@pivotal.io> wrote:
>
> > Hi Aditya,
> >
> > Welcome to the MADlib community!
> >
> > Gaussian Mixture models is extrememly useful and we would heartily
> welcome
> > a contribution for it. The SQLEM paper might be oversimplifying the
> > capabilities of the database (e.g. assuming there is no array type is
> > unnecessary for Postgresql). You could speed things (both dev time and
> > execution time) by writing some of the functions in C++. K-means is an
> > example of how clustering is implemented.
> > IMO, assuming the same covariance matrix is reasonable. We could extend
> the
> > capabilities after the initial implementation is complete.
> >
> > There was some work started a long time ago that built perceptrons using
> > the convex framework (link <https://github.com/iyerr3/madlib/tree/mlp>).
> > There are still some bugs in that code since the trained network isn't
> > converging. You could start there or build a new module - either ways an
> > MLP module is frequently demanded by the data science community.
> >
> > I would suggest starting with Gaussian mixtures and then moving to
> > perceptrons if GMM work is completed.
> >
> > Feel free to ask questions on this forum. Looking forward to
> collaborating
> > with you.
> >
> > Best,
> > Rahul
> >
> > On Thu, Mar 17, 2016 at 2:08 PM, Aditya Nain <adityana...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > My name is Aditya Nain, and I am a graduate student at University of
> > > Florida.
> > > I have been learning MADLib for a while and want to contribute to
> MADLib.
> > > I went through some of the open stories in JIRA and started working on
> > > MADLIB-410  :
> > >
> > >
> >
> https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB
> > >
> > > which is about implementing Gaussian Mixture Model using Expectation
> > > Maximization (EM) algorithm.
> > >
> > > I came across the following paper while searching for distributed EM
> > > algorithm which can be implemented in MADLib.
> > >
> > > Carlos Ordonez, Paul Cereghini "SQLEM: fast clustering in SQL using the
> > EM
> > > algorithm" ACM SIGMOD Record, Volume 29 Issue 2, June 2000 Pages
> 559-570.
> > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.7564
> > >
> > > I thought of implementing the approach discussed in the paper, but the
> > > paper makes an assumption that the covariance martix is the same for
> all
> > > the clusters ( i.e covariance matrix is same for all the Gaussian
> > > distributions). So, I wanted to know the opinion of the community if
> it's
> > > fine to go with the assumption made in the paper and implement it in
> > > MADLib.
> > >
> > > Also, currently MADLib doesn't have an implementation of a perceptron,
> > nor
> > > did I find any open story related to it in JIRA. I came across the
> > > following paper, which talks about a distributed algorithm for
> > perceptron :
> > >
> > > Ryan McDonald, Keith Hall, Gideon Mann "Distributed training strategies
> > for
> > > the structured perceptron"
> > > http://dl.acm.org/citation.cfm?id=1858068
> > >
> > > Would it useful to have a distributed implementaion of perceptron in
> > > MADlib?
> > >
> > > Thanks,
> > > Aditya
> > >
> >
>


Re: Contributing GMM and Perceptron to MADLib

2016-03-21 Thread Rahul Iyer
Hi Aditya,

Welcome to the MADlib community!

Gaussian Mixture models is extrememly useful and we would heartily welcome
a contribution for it. The SQLEM paper might be oversimplifying the
capabilities of the database (e.g. assuming there is no array type is
unnecessary for Postgresql). You could speed things (both dev time and
execution time) by writing some of the functions in C++. K-means is an
example of how clustering is implemented.
IMO, assuming the same covariance matrix is reasonable. We could extend the
capabilities after the initial implementation is complete.

There was some work started a long time ago that built perceptrons using
the convex framework (link ).
There are still some bugs in that code since the trained network isn't
converging. You could start there or build a new module - either ways an
MLP module is frequently demanded by the data science community.

I would suggest starting with Gaussian mixtures and then moving to
perceptrons if GMM work is completed.

Feel free to ask questions on this forum. Looking forward to collaborating
with you.

Best,
Rahul

On Thu, Mar 17, 2016 at 2:08 PM, Aditya Nain  wrote:

> Hi,
>
> My name is Aditya Nain, and I am a graduate student at University of
> Florida.
> I have been learning MADLib for a while and want to contribute to MADLib.
> I went through some of the open stories in JIRA and started working on
> MADLIB-410  :
>
> https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB
>
> which is about implementing Gaussian Mixture Model using Expectation
> Maximization (EM) algorithm.
>
> I came across the following paper while searching for distributed EM
> algorithm which can be implemented in MADLib.
>
> Carlos Ordonez, Paul Cereghini "SQLEM: fast clustering in SQL using the EM
> algorithm" ACM SIGMOD Record, Volume 29 Issue 2, June 2000 Pages 559-570.
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.7564
>
> I thought of implementing the approach discussed in the paper, but the
> paper makes an assumption that the covariance martix is the same for all
> the clusters ( i.e covariance matrix is same for all the Gaussian
> distributions). So, I wanted to know the opinion of the community if it's
> fine to go with the assumption made in the paper and implement it in
> MADLib.
>
> Also, currently MADLib doesn't have an implementation of a perceptron, nor
> did I find any open story related to it in JIRA. I came across the
> following paper, which talks about a distributed algorithm for perceptron :
>
> Ryan McDonald, Keith Hall, Gideon Mann "Distributed training strategies for
> the structured perceptron"
> http://dl.acm.org/citation.cfm?id=1858068
>
> Would it useful to have a distributed implementaion of perceptron in
> MADlib?
>
> Thanks,
> Aditya
>


Re: Generic per-element array ops?

2016-02-26 Thread Rahul Iyer
I think there would be lot of benefit for MADlib if such operators are in
core. Questions like the one raised in the PR would be better answered by
the Postgres community.

We have built generic functions with the same idea (see [1]). We define the
following operations with arrays:

function(array,array)->array (calls: General_2Array_to_Array)
function(array,scalar)->array (calls: General_Array_to_Array)
function(array,array)->scalar (calls: General_2Array_to_Element)
function(array,scalar)->scalar (calls: General_Array_to_Element)
function(array,scalar)->struct (calls: General_Array_to_Struct)


We also define about 20 floating point element operations. Each
element-wise array operation then boils down to just calling the specific
General* function.

[1]
https://github.com/apache/incubator-madlib/blob/master/methods/array_ops/src/pg_gp/array_ops.c

Best,
Rahul

On Fri, Feb 26, 2016 at 11:18 AM Jim Nasby  wrote:

> Looking at [1] reminded me of something I've felt is missing from
> Postgres arrays: the ability to perform arbitrary operations on arrays
> on an element-by-element basis. You can sort-of simulate that with
> unnest, but it's awkward and slow.
>
> Instead of functions for specific per-element operations (ie:
> array_add()), would a more generic function (ie: array_op('+', array1,
> array2)) benefit MADlib? I suspect the Postgres community would accept
> such a function in core.
>
> [1]
>
> https://github.com/apache/incubator-madlib/pull/22/files#diff-ed598467a50f51272f2a5ad73c503a34L702
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>


Re: build trying to install depencacies automatically

2016-02-16 Thread Rahul Iyer
I haven't seen this before. All dependencies should go to the same prefix
folder as the madlib build.
- Are you seeing something being installed to "/bin"? Where are you
building madlib and what's the prefix you're using?
- Could you please let us know the platform details and any other
information to help reproduce?

On Mon, Feb 15, 2016 at 3:53 PM, Ivan Novick  wrote:

> Hi there,
>
> It seems the build is trying to download and install some software.
>
> However its failing trying to install into the bin directory owned by root.
>
> So question is how do i install the dependencies as root separately from
> building madlib which i want to build and install not as root?
>
> Does this make sense or match with other's experiences?
>
> Ivan
>


Re: beta docs for path functions

2016-02-15 Thread Rahul Iyer
If you run 'make doc', the beta documentation will be available at
'doc/user/html/path_8sql__in.html'.

On Mon, Feb 15, 2016 at 6:21 PM, Ivan Novick  wrote:

> Hi all,
>
> I have built the latest madlib code and installed it in GPDB dev build.
>
> Do we have any beta docs or README for the new PATH functionality being
> built?
>
> Cheers,
> Ivan
>


Re: Bayesian Analysis using MADlib (Gibbs Sampling for Probit Regression)

2016-01-15 Thread Rahul Iyer
Thanks for your comments, Caleb.

@Gautam: as I mentioned in the community call today, we have an
aggregate function, crossprod(float8[], float8[]), that can be used to
perform the X'X and X'Y operation.
- for X'X, the row_vec column would be both vector inputs
- for X'Y, the row_vec column of X would be the first input and the Y
value as an array would be the 2nd input (crossprod needs to treat the
Y as a 1x1 vector).
You would, however, have to be careful of the X'X output - it's the
matrix flattened into an array, so you would have to reshape it.

As Caleb said, we would benefit by inspecting the distribution of the
two input matrices in matrix_mult and switch between the currently
implemented inner product and this crossprod aggregate (outer
product).

On Fri, Jan 15, 2016 at 2:52 PM, Caleb Welton  wrote:
> Sorry I missed the community call this morning.  I heard that this was
> discussed in more detail, but haven't seen the minutes of the call posted
> yet.  Here are a couple more thoughts on this:
>
> The matrix operation based implementation offered by Guatam is intuitive
> and logical way of describing the algorithm, if we had an efficient way of
> expressing algorithms like this it would greatly simply the process of
> adding new algorithms and lower the barrier to entry for contributions to
> the project.  Which would be a good thing, so I wanted to spend a bit more
> thought on what this would take and why this solution is not efficient
> today.
>
> Primarily the existing implementation we have for calculating X_T_X in
> MADlib is singnificantly more efficient than the implementation within
> madlib.matrix_mult(), but the implementation in madlib.matrix_mult() is
> much more general purpose.  The existing implementation is hard coded to
> handle the fact that both X and t(X) are operating on the same matrix and
> that this specific calculation is such that each row of the matrix becomes
> the column in the transpose that it is multiplied with meaning that if we
> have all the data for the row then the contributions from that row can be
> calculated without any additional redistribution of data.  Further since
> they are the same table we don't have to join the two tables together to
> get that data and we can complete the entire operation with a single scan
> of one table.  We do not seem to have the optimization for this very
> special case enabled in madlib.matrix_mult() resulting in the
> implementation of the multiplication being substantially slower.
>
> Similarly for X_T_Y in our typical cases X and Y are both in the same
> initial input table and in some ways we can think of "XY" as a single
> matrix that we have simply sliced vertically to produce X and Y as separate
> matrices, this means that despite X and Y being different matrices from the
> mathematical expression of the model we can still use the same in-place
> logic that we used for X_T_X.  As expressed in the current
> madlib.matrix_mult() api there is no easy way for matrix_mult to recognize
> this relationship and so we end up forced to go the inefficient route even
> if we added the special case optimization when the left and right sides of
> the multiplication are transpositions of the same matrix.
>
> One path forward that would help make this type of implementation viable
> would be by adding some of these optimizations and possible api
> enhancements into matrix_mult code so that we can get the implementation
> more efficient going this route we could probably get from 30X perfomance
> hit down to only 2X performance hit - based on having to make separate
> scans for X_T_X and X_T_Y rather than being able to combine both
> calculations in a single scan of the data.  Reducing that last 2X would
> take more effort and a greater level of sophistication in our optimization
> routines.  The general case would likely require some amount of code
> generation.
>
> Regards,
>   Caleb
>
> On Thu, Jan 14, 2016 at 5:32 PM, Caleb Welton  wrote:
>
>> Great seeing the prototype work here, I'm sure that there is something
>> that we can find from this work that we can bring into MADlib.
>>
>> However... It is a very different implementation from the existing
>> algorithms, calling into the madlib matrix functions directly rather than
>> having the majority of the work done within the abstraction layer.
>> Unfortunately this leads to a very inefficient implementation.
>>
>> As demonstration of this I ran this test case:
>>
>> Dataset: 1 dependent variable, 4 independent variables + intercept,
>> 10,000,00 observations
>>
>> Run using Postgres 9.4 on a Macbook Pro:
>>
>> Creating the X matrix from source table: 13.9s
>> Creating the Y matrix from source table: 9.1s
>> Computing X_T_X via matrix_mult: 169.2s
>> Computing X_T_Y via matrix_mult: 114.8s
>>
>> Calling madlib.linregr_train directly (implicitly calculates all of the
>> above as well as inverting the X_T_X matrix and calculating some other

Re: MADlib 1.8 Random Forest error (array_of_bigint)

2015-11-30 Thread Rahul Iyer
Hi Tetsuo,

I don't think it's the 'id' that is causing this issue, rather the array of
features. Decision tree combines the continuous and categorical features in
two separate arrays - one of those (most probably the continuous feature)
is empty for a particular tuple. I can't comment more without looking at
the dataset.

Within the array operations module, we're returning the message as
"array_of_bigint" for a float array. That's a minor messaging bug; I'll fix
that as part of the next commit.

Best,
Rahul

On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi 
wrote:

> Hi,
>
> I am currently having an error with the MADlib Random Forest function in
> MADlib1.8.0.  Below is the code I tried.
>
> DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary;
> SELECT madlib.forest_train('test_rf_data', -- input table name
>'rf_output', -- output table name
>'id', -- id column
>'duration', -- dependent variable
>'*',  -- list of features
>NULL,-- exclude columns
>'linkid' -- grouping column
>   ,2::integer -- # of trees
>,5::integer,  -- # of random features
>TRUE::boolean, -- importance
>1,  -- # of permutations
>5, -- max_tree_depth
>10,  -- min_split
>3,  -- min_bucket
>10  -- number of splits per continuous variable
>);
>
> When I tried this with all linkid (the grouping column with 362 linkids),
> I got an error as in "error_random_forest.txt" attached here. The error
> message is says I have the invalid array length but does not tell any
> details what features in the data have this issue.
>
> ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
>
> I guessed this is the error for the bigint columns but the only bigint
> columns is the "id" column. I once had an error that some features have
> identical values in all records, but it is not the case this time because I
> changed the sample size for each linkid as 1000 or above.
> It seems something is zero from the DETAIL saying "0 given" but I have no
> idea what in the data this is referring to.
>
>
> The schema of the input table is as below;
> CREATE TABLE input_table (
> id bigint,
> linkid varchar(32),
> duration double precision,
> sat_flg int,
> sun_flg int,
> holiday_flg int,
> semi_holiday_flg int,
> renkyu_flg int,
> ave_temp numeric,
> ave_wind numeric,
> precip numeric,
> radiation numeric,
> ave_speed numeric,
> travel_time numeric,
> );
>
> Can anybody please let me know what the possible cause of this error? The
> MADlib linear regression worked without any problems.
>
> I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.
>
>
> Thank you,
>
> Tetsuo
>