Re: MLlib mission and goals

2017-01-23 Thread Stephen Boesch
Along the lines of #1:  the spark packages seemed to have had a good start
about two years ago: but now there are not more than a handful in general
use - e.g. databricks CSV.
When the available packages are browsed the majority are incomplete, empty,
unmaintained, or unclear.

Any ideas on how to resurrect spark packages in a way that there will be
sufficient adoption for it to be meaningful?

2017-01-23 17:03 GMT-08:00 Joseph Bradley :

> This thread is split off from the "Feedback on MLlib roadmap process
> proposal" thread for discussing the high-level mission and goals for
> MLlib.  I hope this thread will collect feedback and ideas, not necessarily
> lead to huge decisions.
>
> Copying from the previous thread:
>
> *Seth:*
> """
> I would love to hear some discussion on the higher level goal of Spark
> MLlib (if this derails the original discussion, please let me know and we
> can discuss in another thread). The roadmap does contain specific items
> that help to convey some of this (ML parity with MLlib, model persistence,
> etc...), but I'm interested in what the "mission" of Spark MLlib is. We
> often see PRs for brand new algorithms which are sometimes rejected and
> sometimes not. Do we aim to keep implementing more and more algorithms? Or
> is our focus really, now that we have a reasonable library of algorithms,
> to simply make the existing ones faster/better/more robust? Should we aim
> to make interfaces that are easily extended for developers to easily
> implement their own custom code (e.g. custom optimization libraries), or do
> we want to restrict things to out-of-the box algorithms? Should we focus on
> more flexible, general abstractions like distributed linear algebra?
>
> I was not involved in the project in the early days of MLlib when this
> discussion may have happened, but I think it would be useful to either
> revisit it or restate it here for some of the newer developers.
> """
>
> *Mingjie:*
> """
> +1 general abstractions like distributed linear algebra.
> """
>
>
> I'll add my thoughts, starting with our past *t**rajectory*:
> * Initially, MLlib was mainly trying to build a set of core algorithms.
> * Two years ago, the big effort was adding Pipelines.
> * In the last year, big efforts have been around completing Pipelines and
> making the library more robust.
>
> I agree with Seth that a few *immediate goals* are very clear:
> * feature parity for DataFrame-based API
> * completing and improving testing for model persistence
> * Python, R parity
>
> *In the future*, it's harder to say, but if I had to pick my top 2 items,
> I'd list:
>
> *(1) Making MLlib more extensible*
> It will not be feasible to support a huge number of algorithms, so
> allowing users to customize their ML on Spark workflows will be critical.
> This is IMO the most important thing we could do for MLlib.
> Part of this could be building a healthy community of Spark Packages, and
> we will need to make it easier for users to write their own algorithms and
> packages to facilitate this.  Part of this could be allowing users to
> customize existing algorithms with custom loss functions, etc.
>
> *(2) Consistent improvements to core algorithms*
> A less exciting but still very important item will be constantly improving
> the core set of algorithms in MLlib. This could mean speed, scaling,
> robustness, and usability for the few algorithms which cover 90% of use
> cases.
>
> There are plenty of other possibilities, and it will be great to hear the
> community's thoughts!
>
> Thanks,
> Joseph
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


MLlib mission and goals

2017-01-23 Thread Joseph Bradley
This thread is split off from the "Feedback on MLlib roadmap process
proposal" thread for discussing the high-level mission and goals for
MLlib.  I hope this thread will collect feedback and ideas, not necessarily
lead to huge decisions.

Copying from the previous thread:

*Seth:*
"""
I would love to hear some discussion on the higher level goal of Spark
MLlib (if this derails the original discussion, please let me know and we
can discuss in another thread). The roadmap does contain specific items
that help to convey some of this (ML parity with MLlib, model persistence,
etc...), but I'm interested in what the "mission" of Spark MLlib is. We
often see PRs for brand new algorithms which are sometimes rejected and
sometimes not. Do we aim to keep implementing more and more algorithms? Or
is our focus really, now that we have a reasonable library of algorithms,
to simply make the existing ones faster/better/more robust? Should we aim
to make interfaces that are easily extended for developers to easily
implement their own custom code (e.g. custom optimization libraries), or do
we want to restrict things to out-of-the box algorithms? Should we focus on
more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this
discussion may have happened, but I think it would be useful to either
revisit it or restate it here for some of the newer developers.
"""

*Mingjie:*
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past *t**rajectory*:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and
making the library more robust.

I agree with Seth that a few *immediate goals* are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

*In the future*, it's harder to say, but if I had to pick my top 2 items,
I'd list:

*(1) Making MLlib more extensible*
It will not be feasible to support a huge number of algorithms, so allowing
users to customize their ML on Spark workflows will be critical.  This is
IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and
we will need to make it easier for users to write their own algorithms and
packages to facilitate this.  Part of this could be allowing users to
customize existing algorithms with custom loss functions, etc.

*(2) Consistent improvements to core algorithms*
A less exciting but still very important item will be constantly improving
the core set of algorithms in MLlib. This could mean speed, scaling,
robustness, and usability for the few algorithms which cover 90% of use
cases.

There are plenty of other possibilities, and it will be great to hear the
community's thoughts!

Thanks,
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: Feedback on MLlib roadmap process proposal

2017-01-23 Thread Joseph Bradley
Hi Seth,

The proposal is geared towards exactly the issue you're describing:
providing more visibility into the capacity and intentions of committers.
If there are things you'd add to it or change to improve further, it would
be great to hear ideas!  The past roadmap JIRA has some more background
discussion which is worth looking at too.

Let's break off the MLlib mission discussion into another thread.  I'll
start one now.

Thanks,
Joseph

On Thu, Jan 19, 2017 at 1:51 PM, Felix Cheung 
wrote:

> Hi Seth
>
> Re: "The most important thing we can do, given that MLlib currently has a
> very limited committer review bandwidth, is to make clear issues that, if
> worked on, will definitely get reviewed. "
>
> We are adopting a Shepherd model, as described in the JIRA Joseph has, in
> which, when assigned, the Shepherd will see it through with the contributor
> to make sure it lands with the target release.
>
> I'm sure Joseph can explain it better than I do ;)
>
>
> _
> From: Mingjie Tang 
> Sent: Thursday, January 19, 2017 10:30 AM
> Subject: Re: Feedback on MLlib roadmap process proposal
> To: Seth Hendrickson 
> Cc: Joseph Bradley , 
>
>
>
> +1 general abstractions like distributed linear algebra.
>
> On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson <
> seth.hendrickso...@gmail.com> wrote:
>
>> I think the proposal laid out in SPARK-18813 is well done, and I do think
>> it is going to improve the process going forward. I also really like the
>> idea of getting the community to vote on JIRAs to give some of them
>> priority - provided that we listen to those votes, of course. The biggest
>> problem I see is that we do have several active contributors and those who
>> want to help implement these changes, but PRs are reviewed rather
>> sporadically and I imagine it is very difficult for contributors to
>> understand why some get reviewed and some do not. The most important thing
>> we can do, given that MLlib currently has a very limited committer review
>> bandwidth, is to make clear issues that, if worked on, will definitely get
>> reviewed. A hard thing to do in open source, no doubt, but even if we have
>> to limit the scope of such issues to a very small subset, it's a gain for
>> all I think.
>>
>> On a related note, I would love to hear some discussion on the higher
>> level goal of Spark MLlib (if this derails the original discussion, please
>> let me know and we can discuss in another thread). The roadmap does contain
>> specific items that help to convey some of this (ML parity with MLlib,
>> model persistence, etc...), but I'm interested in what the "mission" of
>> Spark MLlib is. We often see PRs for brand new algorithms which are
>> sometimes rejected and sometimes not. Do we aim to keep implementing more
>> and more algorithms? Or is our focus really, now that we have a reasonable
>> library of algorithms, to simply make the existing ones faster/better/more
>> robust? Should we aim to make interfaces that are easily extended for
>> developers to easily implement their own custom code (e.g. custom
>> optimization libraries), or do we want to restrict things to out-of-the box
>> algorithms? Should we focus on more flexible, general abstractions like
>> distributed linear algebra?
>>
>> I was not involved in the project in the early days of MLlib when this
>> discussion may have happened, but I think it would be useful to either
>> revisit it or restate it here for some of the newer developers.
>>
>> On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley 
>> wrote:
>>
>>> Hi all,
>>>
>>> This is a general call for thoughts about the process for the MLlib
>>> roadmap proposed in SPARK-18813.  See the section called "Roadmap process."
>>>
>>> Summary:
>>> * This process is about committers indicating intention to shepherd and
>>> review.
>>> * The goal is to improve visibility and communication.
>>> * This is fairly orthogonal to the SIP discussion since this proposal is
>>> more about setting release targets than about proposing future plans.
>>>
>>> Thanks!
>>> Joseph
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] 
>>>
>>
>>
>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-23 Thread Julien Le Dem
Thank you Cheng!

On Mon, Jan 23, 2017 at 12:02 PM, Cheng Lian  wrote:

> Sorry for being late, I'm building a Spark branch based on the most recent
> master to test out 1.8.2-rc1, will post my result here ASAP.
>
> Cheng
>
> On 1/23/17 11:43 AM, Julien Le Dem wrote:
>
> Hi Spark dev,
> Here is the voting thread for parquet 1.8.2 release.
> Cheng or someone else we would appreciate you verify it as well and reply
> to the thread.
>
> On Mon, Jan 23, 2017 at 11:40 AM, Julien Le Dem  wrote:
>
>> +1
>> Followed: https://cwiki.apache.org/confluence/display/PARQUET/How+To+
>> Verify+A+Release
>> checked sums, ran the build and tests.
>> We would appreciate someone from the Spark project (Cheng?) to verify the
>> release as well.
>> CC'ing spark
>>
>>
>> On Mon, Jan 23, 2017 at 10:15 AM, Ryan Blue 
>> wrote:
>>
>>> +1
>>>
>>> On Mon, Jan 23, 2017 at 10:15 AM, Daniel Weeks
>>>  
>>> wrote:
>>>
>>> > +1 checked sums, built, tested
>>> >
>>> > On Mon, Jan 23, 2017 at 9:58 AM, Ryan Blue 
>>> 
>>> > wrote:
>>> >
>>> > > Gabor, that md5 matches what I get. Are you sure you used the right
>>> file?
>>> > > It isn’t the same format that md5sum produces, but if you check the
>>> > octets
>>> > > the hash matches..
>>> > >
>>> > > [blue@work Downloads]$ md5sum apache-parquet-1.8.2.tar.gz
>>> > > b3743995bee616118c28f324598684ba  apache-parquet-1.8.2.tar.gz
>>> > >
>>> > > rb
>>> > > ​
>>> > >
>>> > > On Thu, Jan 19, 2017 at 8:06 AM, Gabor Szadovszky <
>>> > > gabor.szadovs...@cloudera.com> wrote:
>>> > >
>>> > > > Hi Ryan,
>>> > > >
>>> > > > I’ve downloaded the tar and checked the signature and the
>>> checksums.
>>> > SHA
>>> > > > and ASC are fine. MD5 is not and the content does not seem to be a
>>> > common
>>> > > > MD5 either:
>>> > > > apache-parquet-1.8.2.tar.gz: B3 74 39 95 BE E6 16 11  8C 28 F3 24
>>> 59 86
>>> > > 84
>>> > > > BA
>>> > > >
>>> > > > The artifacts on Nexus are good with all the related signatures and
>>> > > > checksums. The source zip properly contains the files from the repo
>>> > with
>>> > > > the tag apache-parquet-1.8.2.
>>> > > >
>>> > > > Regards,
>>> > > > Gabor
>>> > > >
>>> > > > > On 19 Jan 2017, at 04:09, Ryan Blue  wrote:
>>> > > > >
>>> > > > > Hi everyone,
>>> > > > >
>>> > > > > I propose the following RC to be released as official Apache
>>> Parquet
>>> > > > 1.8.2
>>> > > > > release.
>>> > > > >
>>> > > > > The commit id is c6522788629e590a53eb79874b95f6c3ff11f16c
>>> > > > > * This corresponds to the tag: apache-parquet-1.8.2
>>> > > > > * https://github.com/apache/parquet-mr/tree/c6522788
>>> > > > > *
>>> > > > > https://git-wip-us.apache.org/repos/asf/projects/repo?p=
>>> > > > parquet-mr.git=commit=c6522788
>>> > > > >
>>> > > > > The release tarball, signature, and checksums are here:
>>> > > > > * https://dist.apache.org/repos/dist/dev/parquet/apache-
>>> > > > parquet-1.8.2-rc1
>>> > > > >
>>> > > > > You can find the KEYS file here:
>>> > > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
>>> > > > >
>>> > > > > Binary artifacts are staged in Nexus here:
>>> > > > > *
>>> > > > > https://repository.apache.org/content/groups/staging/org/
>>> > > > apache/parquet/parquet/1.8.2/
>>> > > > >
>>> > > > > This is a patch release with backports from the master branch.
>>> For a
>>> > > > > detailed summary, see the spreadsheet here:
>>> > > > >
>>> > > > > *
>>> > > > > https://docs.google.com/spreadsheets/d/1NAuY3c77Egs6REu-
>>> > > > UVkQqPswpVYVgZTTnY3bM0SPVRs/edit#gid=0
>>> > > > >
>>> > > > > Please download, verify, and test.
>>> > > > >
>>> > > > > Please vote by the end of Monday, 18 January.
>>> > > > >
>>> > > > > [ ] +1 Release this as Apache Parquet 1.8.2
>>> > > > > [ ] +0
>>> > > > > [ ] -1 Do not release this because...
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > --
>>> > > > > Ryan Blue
>>> > > >
>>> > > >
>>> > >
>>> > >
>>> > > --
>>> > > Ryan Blue
>>> > > Software Engineer
>>> > > Netflix
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>>
>> --
>> Julien
>>
>
>
>
> --
> Julien
>
>
>


-- 
Julien


Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-23 Thread Cheng Lian
Sorry for being late, I'm building a Spark branch based on the most 
recent master to test out 1.8.2-rc1, will post my result here ASAP.


Cheng


On 1/23/17 11:43 AM, Julien Le Dem wrote:

Hi Spark dev,
Here is the voting thread for parquet 1.8.2 release.
Cheng or someone else we would appreciate you verify it as well and 
reply to the thread.


On Mon, Jan 23, 2017 at 11:40 AM, Julien Le Dem > wrote:


+1
Followed:
https://cwiki.apache.org/confluence/display/PARQUET/How+To+Verify+A+Release


checked sums, ran the build and tests.
We would appreciate someone from the Spark project (Cheng?) to
verify the release as well.
CC'ing spark


On Mon, Jan 23, 2017 at 10:15 AM, Ryan Blue
> wrote:

+1

On Mon, Jan 23, 2017 at 10:15 AM, Daniel Weeks

wrote:

> +1 checked sums, built, tested
>
> On Mon, Jan 23, 2017 at 9:58 AM, Ryan Blue

> wrote:
>
> > Gabor, that md5 matches what I get. Are you sure you used
the right file?
> > It isn’t the same format that md5sum produces, but if you
check the
> octets
> > the hash matches..
> >
> > [blue@work Downloads]$ md5sum apache-parquet-1.8.2.tar.gz
> > b3743995bee616118c28f324598684ba apache-parquet-1.8.2.tar.gz
> >
> > rb
> > ​
> >
> > On Thu, Jan 19, 2017 at 8:06 AM, Gabor Szadovszky <
> > gabor.szadovs...@cloudera.com
> wrote:
> >
> > > Hi Ryan,
> > >
> > > I’ve downloaded the tar and checked the signature and
the checksums.
> SHA
> > > and ASC are fine. MD5 is not and the content does not
seem to be a
> common
> > > MD5 either:
> > > apache-parquet-1.8.2.tar.gz: B3 74 39 95 BE E6 16 11  8C
28 F3 24 59 86
> > 84
> > > BA
> > >
> > > The artifacts on Nexus are good with all the related
signatures and
> > > checksums. The source zip properly contains the files
from the repo
> with
> > > the tag apache-parquet-1.8.2.
> > >
> > > Regards,
> > > Gabor
> > >
> > > > On 19 Jan 2017, at 04:09, Ryan Blue > wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > I propose the following RC to be released as official
Apache Parquet
> > > 1.8.2
> > > > release.
> > > >
> > > > The commit id is c6522788629e590a53eb79874b95f6c3ff11f16c
> > > > * This corresponds to the tag: apache-parquet-1.8.2
> > > > * https://github.com/apache/parquet-mr/tree/c6522788

> > > > *
> > > >
https://git-wip-us.apache.org/repos/asf/projects/repo?p=

> > > parquet-mr.git=commit=c6522788
> > > >
> > > > The release tarball, signature, and checksums are here:
> > > > *
https://dist.apache.org/repos/dist/dev/parquet/apache-

> > > parquet-1.8.2-rc1
> > > >
> > > > You can find the KEYS file here:
> > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS

> > > >
> > > > Binary artifacts are staged in Nexus here:
> > > > *
> > > >
https://repository.apache.org/content/groups/staging/org/

> > > apache/parquet/parquet/1.8.2/
> > > >
> > > > This is a patch release with backports from the master
branch. For a
> > > > detailed summary, see the spreadsheet here:
> > > >
> > > > *
> > > >
https://docs.google.com/spreadsheets/d/1NAuY3c77Egs6REu-

> > > UVkQqPswpVYVgZTTnY3bM0SPVRs/edit#gid=0
> > > >
> > > > Please download, verify, and test.
> > > >
> > > > Please vote by the end of Monday, 18 January.
> > > >
> > > > [ ] +1 Release this as Apache Parquet 1.8.2
> > > > [ ] +0
> > > > [ ] -1 Do not release this because...
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > 

Re: [VOTE] Release Apache Parquet 1.8.2 RC1

2017-01-23 Thread Julien Le Dem
Hi Spark dev,
Here is the voting thread for parquet 1.8.2 release.
Cheng or someone else we would appreciate you verify it as well and reply
to the thread.

On Mon, Jan 23, 2017 at 11:40 AM, Julien Le Dem  wrote:

> +1
> Followed: https://cwiki.apache.org/confluence/display/PARQUET/
> How+To+Verify+A+Release
> checked sums, ran the build and tests.
> We would appreciate someone from the Spark project (Cheng?) to verify the
> release as well.
> CC'ing spark
>
>
> On Mon, Jan 23, 2017 at 10:15 AM, Ryan Blue 
> wrote:
>
>> +1
>>
>> On Mon, Jan 23, 2017 at 10:15 AM, Daniel Weeks > >
>> wrote:
>>
>> > +1 checked sums, built, tested
>> >
>> > On Mon, Jan 23, 2017 at 9:58 AM, Ryan Blue 
>> > wrote:
>> >
>> > > Gabor, that md5 matches what I get. Are you sure you used the right
>> file?
>> > > It isn’t the same format that md5sum produces, but if you check the
>> > octets
>> > > the hash matches..
>> > >
>> > > [blue@work Downloads]$ md5sum apache-parquet-1.8.2.tar.gz
>> > > b3743995bee616118c28f324598684ba  apache-parquet-1.8.2.tar.gz
>> > >
>> > > rb
>> > > ​
>> > >
>> > > On Thu, Jan 19, 2017 at 8:06 AM, Gabor Szadovszky <
>> > > gabor.szadovs...@cloudera.com> wrote:
>> > >
>> > > > Hi Ryan,
>> > > >
>> > > > I’ve downloaded the tar and checked the signature and the checksums.
>> > SHA
>> > > > and ASC are fine. MD5 is not and the content does not seem to be a
>> > common
>> > > > MD5 either:
>> > > > apache-parquet-1.8.2.tar.gz: B3 74 39 95 BE E6 16 11  8C 28 F3 24
>> 59 86
>> > > 84
>> > > > BA
>> > > >
>> > > > The artifacts on Nexus are good with all the related signatures and
>> > > > checksums. The source zip properly contains the files from the repo
>> > with
>> > > > the tag apache-parquet-1.8.2.
>> > > >
>> > > > Regards,
>> > > > Gabor
>> > > >
>> > > > > On 19 Jan 2017, at 04:09, Ryan Blue  wrote:
>> > > > >
>> > > > > Hi everyone,
>> > > > >
>> > > > > I propose the following RC to be released as official Apache
>> Parquet
>> > > > 1.8.2
>> > > > > release.
>> > > > >
>> > > > > The commit id is c6522788629e590a53eb79874b95f6c3ff11f16c
>> > > > > * This corresponds to the tag: apache-parquet-1.8.2
>> > > > > * https://github.com/apache/parquet-mr/tree/c6522788
>> > > > > *
>> > > > > https://git-wip-us.apache.org/repos/asf/projects/repo?p=
>> > > > parquet-mr.git=commit=c6522788
>> > > > >
>> > > > > The release tarball, signature, and checksums are here:
>> > > > > * https://dist.apache.org/repos/dist/dev/parquet/apache-
>> > > > parquet-1.8.2-rc1
>> > > > >
>> > > > > You can find the KEYS file here:
>> > > > > * https://dist.apache.org/repos/dist/dev/parquet/KEYS
>> > > > >
>> > > > > Binary artifacts are staged in Nexus here:
>> > > > > *
>> > > > > https://repository.apache.org/content/groups/staging/org/
>> > > > apache/parquet/parquet/1.8.2/
>> > > > >
>> > > > > This is a patch release with backports from the master branch.
>> For a
>> > > > > detailed summary, see the spreadsheet here:
>> > > > >
>> > > > > *
>> > > > > https://docs.google.com/spreadsheets/d/1NAuY3c77Egs6REu-
>> > > > UVkQqPswpVYVgZTTnY3bM0SPVRs/edit#gid=0
>> > > > >
>> > > > > Please download, verify, and test.
>> > > > >
>> > > > > Please vote by the end of Monday, 18 January.
>> > > > >
>> > > > > [ ] +1 Release this as Apache Parquet 1.8.2
>> > > > > [ ] +0
>> > > > > [ ] -1 Do not release this because...
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Ryan Blue
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Ryan Blue
>> > > Software Engineer
>> > > Netflix
>> > >
>> >
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
>
> --
> Julien
>



-- 
Julien


Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-23 Thread Michael Allman
Hi Stan,

What OS/version are you using?

Michael

> On Jan 22, 2017, at 11:36 PM, StanZhai  wrote:
> 
> I'm using Parallel GC.
> rxin wrote
>> Are you using G1 GC? G1 sometimes uses a lot more memory than the size
>> allocated.
>> 
>> 
>> On Sun, Jan 22, 2017 at 12:58 AM StanZhai 
> 
>> mail@
> 
>>  wrote:
>> 
>>> Hi all,
>>> 
>>> 
>>> 
>>> We just upgraded our Spark from 1.6.2 to 2.1.0.
>>> 
>>> 
>>> 
>>> Our Spark application is started by spark-submit with config of
>>> 
>>> `--executor-memory 35G` in standalone model, but the actual use of memory
>>> up
>>> 
>>> to 65G after a full gc(jmap -histo:live $pid) as follow:
>>> 
>>> 
>>> 
>>> test@c6 ~ $ ps aux | grep CoarseGrainedExecutorBackend
>>> 
>>> test  181941  181 34.7 94665384 68836752 ?   Sl   09:25 711:21
>>> 
>>> /home/test/service/jdk/bin/java -cp
>>> 
>>> 
>>> /home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/spark/conf/:/home/test/service/spark/jars/*:/home/test/service/hadoop/etc/hadoop/
>>> 
>>> -Xmx35840M -Dspark.driver.port=47781 -XX:+PrintGCDetails
>>> 
>>> -XX:+PrintGCDateStamps -Xloggc:./gc.log -verbose:gc
>>> 
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
>>> 
>>> spark://
> 
>> CoarseGrainedScheduler@.xxx
> 
>> :47781 --executor-id 1
>>> 
>>> --hostname test-192 --cores 36 --app-id app-20170122092509-0017
>>> --worker-url
>>> 
>>> spark://Worker@test-192:33890
>>> 
>>> 
>>> 
>>> Our Spark jobs are all sql.
>>> 
>>> 
>>> 
>>> The exceed memory looks like off-heap memory, but the default value of
>>> 
>>> `spark.memory.offHeap.enabled` is `false`.
>>> 
>>> 
>>> 
>>> We didn't find the problem in Spark 1.6.x, what causes this in Spark
>>> 2.1.0?
>>> 
>>> 
>>> 
>>> Any help is greatly appreicated!
>>> 
>>> 
>>> 
>>> Best,
>>> 
>>> Stan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> 
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697.html
>>> 
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com .
>>> 
>>> 
>>> 
>>> -
>>> 
>>> To unsubscribe e-mail: 
> 
>> dev-unsubscribe@.apache
> 
>>> 
>>> 
>>> 
>>> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20707.html
>  
> 
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
> .
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 


Re: A question about creating persistent table when in-memory catalog is used

2017-01-23 Thread Xiao Li
Reynold mentioned the direction we are heading. You can see many PRs the
community submitted are for this target. To achieve this, a lot of works we
need to do.

For example, for some serde, Hive metastore will infer the schema when the
schema is not provided, but our InMemoryCatalog does not have such a
capability. Thus, we need to see how to resolve this.

Hopefully, it answers your question. BTW, the issue you mentioned at the
beginning has been resolved. Please fetch the latest master. You are unable
to create such a hive serde table without Hive support.

Thanks,

Xiao Li


2017-01-23 0:01 GMT-08:00 Shuai Lin :

> Cool, thanks for the info.
>
> I think this is something we are going to change to completely decouple
>> the Hive support and catalog.
>
>
> Is there a ticket for this? I did a search in jira and only found
> "SPARK-16275: Implement all the Hive fallback functions", which seems to be
> related to it.
>
>
> On Mon, Jan 23, 2017 at 3:21 AM, Xiao Li  wrote:
>
>> Agree. : )
>>
>> 2017-01-22 11:20 GMT-08:00 Reynold Xin :
>>
>>> To be clear there are two separate "hive" we are talking about here. One
>>> is the catalog, and the other is the Hive serde and UDF support. We want to
>>> get to a point that the choice of catalog does not impact the functionality
>>> in Spark other than where the catalog is stored.
>>>
>>>
>>> On Sun, Jan 22, 2017 at 11:18 AM Xiao Li  wrote:
>>>
 We have a pending PR to block users to create the Hive serde table when
 using InMemroyCatalog. See: https://github.com/apache/spark/pull/16587
 I believe it answers your question.

 BTW, we still can create the regular data source tables and insert the
 data into the tables. The major difference is whether the metadata is
 persistently stored or not.

 Thanks,

 Xiao Li

 2017-01-22 11:14 GMT-08:00 Reynold Xin :

 I think this is something we are going to change to completely decouple
 the Hive support and catalog.


 On Sun, Jan 22, 2017 at 4:51 AM Shuai Lin 
 wrote:

 Hi all,

 Currently when the in-memory catalog is used, e.g. through `--conf
 spark.sql.catalogImplementation=in-memory`, we can create a persistent
 table, but inserting into this table would fail with error message "Hive
 support is required to insert into the following tables..".

 sql("create table t1 (id int, name string, dept string)") // OK
 sql("insert into t1 values (1, 'name1', 'dept1')")  // ERROR


 This doesn't make sense for me, because this table would always be
 empty if we can't insert into it, thus would be of no use. But I wonder if
 there are other good reasons for the current logic. If not, I would propose
 to raise an error when creating the table in the first place.

 Thanks!

 Regards,
 Shuai Lin (@lins05)








>>
>