date:20151113

Re: [build system] short jenkins downtime tomorrow morning, 11-13-2015 @ 7am PST

2015-11-13 Thread shane knapp

this is still ongoing.  the update is running 'chown -R jenkins' on
the jenkins root directory, which is a hair under 3T.

this might take a while...  :\

shane

On Fri, Nov 13, 2015 at 6:36 AM, shane knapp  wrote:
> this is happening now.
>
> On Thu, Nov 12, 2015 at 12:14 PM, shane knapp  wrote:
>> i will admit that it does seem like a bad idea to poke jenkins on
>> friday the 13th, but there's a release that fixes a lot of security
>> issues:
>>
>> https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-11-11
>>
>> i'll set jenkins to stop kicking off any new builds around 5am PST,
>> and will upgrade and restart jenkins around 7am PST.  barring anything
>> horrible happening, we should be back up and building by 730am.
>>
>> ...and this time, i promise not to touch any of the plugins.  :)
>>
>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Seems jenkins is down (or very slow)?

2015-11-13 Thread shane knapp

were you hitting any particular URL when you noticed this, or was it
generally slow?

On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai  wrote:
> Hi Guys,
>
> Seems Jenkins is down or very slow? Does anyone else experience it or just
> me?
>
> Thanks,
>
> Yin

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [build system] short jenkins downtime tomorrow morning, 11-13-2015 @ 7am PST

2015-11-13 Thread shane knapp

phew.  this is finally done...  jenkins is up and building.

On Fri, Nov 13, 2015 at 7:16 AM, shane knapp  wrote:
> this is still ongoing.  the update is running 'chown -R jenkins' on
> the jenkins root directory, which is a hair under 3T.
>
> this might take a while...  :\
>
> shane
>
> On Fri, Nov 13, 2015 at 6:36 AM, shane knapp  wrote:
>> this is happening now.
>>
>> On Thu, Nov 12, 2015 at 12:14 PM, shane knapp  wrote:
>>> i will admit that it does seem like a bad idea to poke jenkins on
>>> friday the 13th, but there's a release that fixes a lot of security
>>> issues:
>>>
>>> https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-11-11
>>>
>>> i'll set jenkins to stop kicking off any new builds around 5am PST,
>>> and will upgrade and restart jenkins around 7am PST.  barring anything
>>> horrible happening, we should be back up and building by 730am.
>>>
>>> ...and this time, i promise not to touch any of the plugins.  :)
>>>
>>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A proposal for Spark 2.0

2015-11-13 Thread Kostas Sakellis

We have veered off the topic of Spark 2.0 a little bit here - yes we can
talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
propose we have one more 1.x release after Spark 1.6. This will allow us to
stabilize a few of the new features that were added in 1.6:

1) the experimental Datasets API
2) the new unified memory manager.

I understand our goal for Spark 2.0 is to offer an easy transition but
there will be users that won't be able to seamlessly upgrade given what we
have discussed as in scope for 2.0. For these users, having a 1.x release
with these new features/APIs stabilized will be very beneficial. This might
make Spark 1.7 a lighter release but that is not necessarily a bad thing.

Any thoughts on this timeline?

Kostas Sakellis



On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  wrote:

> Agree, more features/apis/optimization need to be added in DF/DS.
>
>
>
> I mean, we need to think about what kind of RDD APIs we have to provide to
> developer, maybe the fundamental API is enough, like, the ShuffledRDD
> etc..  But PairRDDFunctions probably not in this category, as we can do the
> same thing easily with DF/DS, even better performance.
>
>
>
> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
> *Sent:* Friday, November 13, 2015 11:23 AM
> *To:* Stephen Boesch
>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Hmmm... to me, that seems like precisely the kind of thing that argues for
> retaining the RDD API but not as the first thing presented to new Spark
> developers: "Here's how to use groupBy with DataFrames Until the
> optimizer is more fully developed, that won't always get you the best
> performance that can be obtained.  In these particular circumstances, ...,
> you may want to use the low-level RDD API while setting
> preservesPartitioning to true.  Like this"
>
>
>
> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch  wrote:
>
> My understanding is that  the RDD's presently have more support for
> complete control of partitioning which is a key consideration at scale.
> While partitioning control is still piecemeal in  DF/DS  it would seem
> premature to make RDD's a second-tier approach to spark dev.
>
>
>
> An example is the use of groupBy when we know that the source relation
> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
> sql still does not allow that knowledge to be applied to the optimizer - so
> a full shuffle will be performed. However in the native RDD we can use
> preservesPartitioning=true.
>
>
>
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>
> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
>
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>
> I am not sure what the best practice for this specific problem, but it’s
> really worth to think about it in 2.0, as it is a painful issue for lots of
> users.
>
>
>
> By the way, is it also an opportunity to deprecate the RDD API (or
> internal API only?)? As lots of its functionality overlapping with
> DataFrame or DataSet.
>
>
>
> Hao
>
>
>
> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
> *Sent:* Friday, November 13, 2015 5:27 AM
> *To:* Nicholas Chammas
> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
> Reynold Xin
>
>
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> I know we want to keep breaking changes to a minimum but I'm hoping that
> with Spark 2.0 we can also look at better classpath isolation with user
> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> setting it true by default, and not allow any spark transitive dependencies
> to leak into user code. For backwards compatibility we can have a whitelist
> if we want but I'd be good if we start requiring user apps to explicitly
> pull in all their dependencies. From what I can tell, Hadoop 3 is also
> moving in this direction.
>
>
>
> Kostas
>
>
>
> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> With regards to Machine learning, it

let spark streaming sample come to stop

2015-11-13 Thread Renyi Xiong

Hi,

I try to run the following 1.4.1 sample by putting a words.txt under
localdir

bin\run-example org.apache.spark.examples.streaming.HdfsWordCount localdir

2 questions

1. it does not pick up words.txt because it's 'old' I guess - any option to
let it picked up?
2. I managed to put a 'new' file on the fly which got picked up, but after
processing, the program doesn't stop (keeps generating empty RDDs instead),
any option to let it stop when no new files come in (otherwise it blocks
others when I want to run multiple samples?)

Thanks,
Renyi.

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin

It only runs tests that are impacted by the change. E.g. if you only modify
SQL, it won't run the core or streaming tests.


On Fri, Nov 13, 2015 at 11:17 AM, Ted Yu  wrote:

> Hi,
> I noticed that SparkPullRequestBuilder completes much faster than maven
> Jenkins build.
>
> From
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45871/consoleFull
> , I couldn't get exact time the builder started but looks like the duration
> was around 20 minutes.
>
> From
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/4099/console
> :
>
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 01:42 h
>
>
> Can someone enlighten me on the sets of tests executed by 
> SparkPullRequestBuilder ?
>
>
> BTW I noticed that recent Jenkins builds were not in good shape:
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/
>
>
> Comments are welcome.
>
>

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin

I actually tried to build a binary for 1.4.2 and wanted to start voting,
but there was an issue with the release script that failed the jenkins job.
Would be great to kick off a 1.4.2 release.

On Fri, Nov 13, 2015 at 1:00 PM, Andrew Lee  wrote:

> Hi All,
>
>
> I'm wondering if Spark 1.4.2 had been voted by any chance or if I have
> overlooked and we are targeting 1.4.3?
>
>
> By looking at the JIRA
>
>
> https://issues.apache.org/jira/browse/SPARK/fixforversion/12332833/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
>
>
> All issues were resolved and no blockers. Anyone knows what happened to
> this release?
>
>
>
> or was there any recommendation to skip that and ask users to use Spark
> 1.5.2 instead?
>

Incubator Proposal for Spark-Kernel

2015-11-13 Thread DavidFallside

Hello, I wanted to make known a new Apache Incubator proposal for
"Spark-Kernel", https://wiki.apache.org/incubator/SparkKernelProposal, which
provides applications with a mechanism to interactively and remotely access
Spark. The proposal is just starting to be discussed on the general
incubator list and the Spark community's input would be very valuable.
Thanks,
David




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Incubator-Proposal-for-Spark-Kernel-tp15211.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Ted Yu

Can I assume that for any particular test, if it passes reliably on
SparkPullRequestBuilder,
it should pass on maven Jenkins ?

If so, should flaky test(s) be disabled, strengthened and enabled again ?

Cheers

On Fri, Nov 13, 2015 at 11:20 AM, Reynold Xin  wrote:

> It only runs tests that are impacted by the change. E.g. if you only
> modify SQL, it won't run the core or streaming tests.
>
>
> On Fri, Nov 13, 2015 at 11:17 AM, Ted Yu  wrote:
>
>> Hi,
>> I noticed that SparkPullRequestBuilder completes much faster than maven
>> Jenkins build.
>>
>> From
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45871/consoleFull
>> , I couldn't get exact time the builder started but looks like the duration
>> was around 20 minutes.
>>
>> From
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/4099/console
>> :
>>
>> [INFO] 
>> 
>> [INFO] BUILD FAILURE
>> [INFO] 
>> 
>> [INFO] Total time: 01:42 h
>>
>>
>> Can someone enlighten me on the sets of tests executed by 
>> SparkPullRequestBuilder ?
>>
>>
>> BTW I noticed that recent Jenkins builds were not in good shape:
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/
>>
>>
>> Comments are welcome.
>>
>>
>

Re: pyspark with pypy not work for spark-1.5.1

2015-11-13 Thread Davies Liu

We already test CPython 2.6, CPython 3.4 and PyPy 2.5, it took more
than 30 min to run (without parallelization),
I think it should be enough.

PyPy 2.2 is too old that we have not enough resource to support that.

On Fri, Nov 6, 2015 at 2:27 AM, Chang Ya-Hsuan  wrote:
> Hi I run ./python/ru-tests to test following modules of spark-1.5.1:
>
> [pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql',
> 'pyspark-streaming]
>
> against to following pypy versions:
>
> pypy-2.2.1  pypy-2.3  pypy-2.3.1  pypy-2.4.0  pypy-2.5.0  pypy-2.5.1
> pypy-2.6.0  pypy-2.6.1  pypy-4.0.0
>
> except pypy-2.2.1, all others pass the test.
>
> the error message of pypy-2.2.1 is:
>
> Traceback (most recent call last):
>   File "app_main.py", line 72, in run_toplevel
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/runpy.py",
> line 151, in _run_module_as_main
> mod_name, loader, code, fname = _get_module_details(mod_name)
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/runpy.py",
> line 101, in _get_module_details
> loader = get_loader(mod_name)
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/pkgutil.py",
> line 465, in get_loader
> return find_loader(fullname)
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/pkgutil.py",
> line 475, in find_loader
> for importer in iter_importers(fullname):
>   File "/home/yahsuan/.pyenv/versions/pypy-2.2.1/lib-python/2.7/pkgutil.py",
> line 431, in iter_importers
> __import__(pkg)
>   File "pyspark/__init__.py", line 41, in 
> from pyspark.context import SparkContext
>   File "pyspark/context.py", line 26, in 
> from pyspark import accumulators
>   File "pyspark/accumulators.py", line 98, in 
> from pyspark.serializers import read_int, PickleSerializer
>   File "pyspark/serializers.py", line 400, in 
> _hijack_namedtuple()
>   File "pyspark/serializers.py", line 378, in _hijack_namedtuple
> _old_namedtuple = _copy_func(collections.namedtuple)
>   File "pyspark/serializers.py", line 376, in _copy_func
> f.__defaults__, f.__closure__)
> AttributeError: 'function' object has no attribute '__closure__'
>
> p.s. would you want to test different pypy versions on your Jenkins? maybe I
> could help
>
> On Fri, Nov 6, 2015 at 2:23 AM, Josh Rosen  wrote:
>>
>> You could try running PySpark's own unit tests. Try ./python/run-tests
>> --help for instructions.
>>
>> On Thu, Nov 5, 2015 at 12:31 AM Chang Ya-Hsuan  wrote:
>>>
>>> I've test on following pypy version against to spark-1.5.1
>>>
>>>   pypy-2.2.1
>>>   pypy-2.3
>>>   pypy-2.3.1
>>>   pypy-2.4.0
>>>   pypy-2.5.0
>>>   pypy-2.5.1
>>>   pypy-2.6.0
>>>   pypy-2.6.1
>>>
>>> I run
>>>
>>> $ PYSPARK_PYTHON=/path/to/pypy-xx.xx/bin/pypy
>>> /path/to/spark-1.5.1/bin/pyspark
>>>
>>> and only pypy-2.2.1 failed.
>>>
>>> Any suggestion to run advanced test?
>>>
>>> On Thu, Nov 5, 2015 at 4:14 PM, Chang Ya-Hsuan 
>>> wrote:

 Thanks for your quickly reply.

 I will test several pypy versions and report the result later.

 On Thu, Nov 5, 2015 at 4:06 PM, Josh Rosen  wrote:
>
> I noticed that you're using PyPy 2.2.1, but it looks like Spark 1.5.1's
> docs say that we only support PyPy 2.3+. Could you try using a newer PyPy
> version to see if that works?
>
> I just checked and it looks like our Jenkins tests are running against
> PyPy 2.5.1, so that version is known to work. I'm not sure what the actual
> minimum supported PyPy version is. Would you be interested in helping to
> investigate so that we can update the documentation or produce a fix to
> restore compatibility with earlier PyPy builds?
>
> On Wed, Nov 4, 2015 at 11:56 PM, Chang Ya-Hsuan 
> wrote:
>>
>> Hi all,
>>
>> I am trying to run pyspark with pypy, and it is work when using
>> spark-1.3.1 but failed when using spark-1.4.1 and spark-1.5.1
>>
>> my pypy version:
>>
>> $ /usr/bin/pypy --version
>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>> [PyPy 2.2.1 with GCC 4.8.4]
>>
>> works with spark-1.3.1
>>
>> $ PYSPARK_PYTHON=/usr/bin/pypy
>> ~/Tool/spark-1.3.1-bin-hadoop2.6/bin/pyspark
>> Python 2.7.3 (2.2.1+dfsg-1ubuntu0.3, Sep 30 2015, 15:18:40)
>> [PyPy 2.2.1 with GCC 4.8.4] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> 15/11/05 15:50:30 WARN Utils: Your hostname, xx resolves to a
>> loopback address: 127.0.1.1; using xxx.xxx.xxx.xxx instead (on interface
>> eth0)
>> 15/11/05 15:50:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind
>> to another address
>> 15/11/05 15:50:31 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> Welcome

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin

Yes. And those have been happening too.


On Fri, Nov 13, 2015 at 12:09 PM, Ted Yu  wrote:

> Can I assume that for any particular test, if it passes reliably on 
> SparkPullRequestBuilder,
> it should pass on maven Jenkins ?
>
> If so, should flaky test(s) be disabled, strengthened and enabled again ?
>
> Cheers
>
> On Fri, Nov 13, 2015 at 11:20 AM, Reynold Xin  wrote:
>
>> It only runs tests that are impacted by the change. E.g. if you only
>> modify SQL, it won't run the core or streaming tests.
>>
>>
>> On Fri, Nov 13, 2015 at 11:17 AM, Ted Yu  wrote:
>>
>>> Hi,
>>> I noticed that SparkPullRequestBuilder completes much faster than maven
>>> Jenkins build.
>>>
>>> From
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45871/consoleFull
>>> , I couldn't get exact time the builder started but looks like the duration
>>> was around 20 minutes.
>>>
>>> From
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/4099/console
>>> :
>>>
>>> [INFO] 
>>> 
>>> [INFO] BUILD FAILURE
>>> [INFO] 
>>> 
>>> [INFO] Total time: 01:42 h
>>>
>>>
>>> Can someone enlighten me on the sets of tests executed by 
>>> SparkPullRequestBuilder ?
>>>
>>>
>>> BTW I noticed that recent Jenkins builds were not in good shape:
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/
>>>
>>>
>>> Comments are welcome.
>>>
>>>
>>
>

Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Andrew Lee

Hi All,


I'm wondering if Spark 1.4.2 had been voted by any chance or if I have 
overlooked and we are targeting 1.4.3?


By looking at the JIRA

https://issues.apache.org/jira/browse/SPARK/fixforversion/12332833/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel


All issues were resolved and no blockers. Anyone knows what happened to this 
release?



or was there any recommendation to skip that and ask users to use Spark 1.5.2 
instead?

Problem with Breadcast variable not deserialized

2015-11-13 Thread Federico Bertola


Hi,
 I've a simple Spark job that tries to broadcast an *Externalizable* 
object across workers. A simple System.out confirm that the object is 
only serialized with writeExternal and than deserialized without 
readExternal. Personally I think this is a bug. I'm using the default 
JavaSerializer with Spark 1.5.*


Kind regards,

Federico.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Seems jenkins is down (or very slow)?

2015-11-13 Thread Yin Huai

It was generally slow. But, after 5 or 10 minutes, it's all good.

On Fri, Nov 13, 2015 at 9:16 AM, shane knapp  wrote:

> were you hitting any particular URL when you noticed this, or was it
> generally slow?
>
> On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai  wrote:
> > Hi Guys,
> >
> > Seems Jenkins is down or very slow? Does anyone else experience it or
> just
> > me?
> >
> > Thanks,
> >
> > Yin
>

Re: A proposal for Spark 2.0

2015-11-13 Thread Mark Hamstra

Why does stabilization of those two features require a 1.7 release instead
of 1.6.1?

On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
wrote:

> We have veered off the topic of Spark 2.0 a little bit here - yes we can
> talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
> propose we have one more 1.x release after Spark 1.6. This will allow us to
> stabilize a few of the new features that were added in 1.6:
>
> 1) the experimental Datasets API
> 2) the new unified memory manager.
>
> I understand our goal for Spark 2.0 is to offer an easy transition but
> there will be users that won't be able to seamlessly upgrade given what we
> have discussed as in scope for 2.0. For these users, having a 1.x release
> with these new features/APIs stabilized will be very beneficial. This might
> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>
> Any thoughts on this timeline?
>
> Kostas Sakellis
>
>
>
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  wrote:
>
>> Agree, more features/apis/optimization need to be added in DF/DS.
>>
>>
>>
>> I mean, we need to think about what kind of RDD APIs we have to provide
>> to developer, maybe the fundamental API is enough, like, the ShuffledRDD
>> etc..  But PairRDDFunctions probably not in this category, as we can do the
>> same thing easily with DF/DS, even better performance.
>>
>>
>>
>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>> *Sent:* Friday, November 13, 2015 11:23 AM
>> *To:* Stephen Boesch
>>
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Hmmm... to me, that seems like precisely the kind of thing that argues
>> for retaining the RDD API but not as the first thing presented to new Spark
>> developers: "Here's how to use groupBy with DataFrames Until the
>> optimizer is more fully developed, that won't always get you the best
>> performance that can be obtained.  In these particular circumstances, ...,
>> you may want to use the low-level RDD API while setting
>> preservesPartitioning to true.  Like this"
>>
>>
>>
>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>> wrote:
>>
>> My understanding is that  the RDD's presently have more support for
>> complete control of partitioning which is a key consideration at scale.
>> While partitioning control is still piecemeal in  DF/DS  it would seem
>> premature to make RDD's a second-tier approach to spark dev.
>>
>>
>>
>> An example is the use of groupBy when we know that the source relation
>> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
>> sql still does not allow that knowledge to be applied to the optimizer - so
>> a full shuffle will be performed. However in the native RDD we can use
>> preservesPartitioning=true.
>>
>>
>>
>> 2015-11-12 17:42 GMT-08:00 Mark Hamstra :
>>
>> The place of the RDD API in 2.0 is also something I've been wondering
>> about.  I think it may be going too far to deprecate it, but changing
>> emphasis is something that we might consider.  The RDD API came well before
>> DataFrames and DataSets, so programming guides, introductory how-to
>> articles and the like have, to this point, also tended to emphasize RDDs --
>> or at least to deal with them early.  What I'm thinking is that with 2.0
>> maybe we should overhaul all the documentation to de-emphasize and
>> reposition RDDs.  In this scheme, DataFrames and DataSets would be
>> introduced and fully addressed before RDDs.  They would be presented as the
>> normal/default/standard way to do things in Spark.  RDDs, in contrast,
>> would be presented later as a kind of lower-level, closer-to-the-metal API
>> that can be used in atypical, more specialized contexts where DataFrames or
>> DataSets don't fully fit.
>>
>>
>>
>> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>>
>> I am not sure what the best practice for this specific problem, but it’s
>> really worth to think about it in 2.0, as it is a painful issue for lots of
>> users.
>>
>>
>>
>> By the way, is it also an opportunity to deprecate the RDD API (or
>> internal API only?)? As lots of its functionality overlapping with
>> DataFrame or DataSet.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>> *Sent:* Friday, November 13, 2015 5:27 AM
>> *To:* Nicholas Chammas
>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
>> Reynold Xin
>>
>>
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> I know we want to keep breaking changes to a minimum but I'm hoping that
>> with Spark 2.0 we can also look at better classpath isolation with user
>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>> setting it true by default, and not allow any spark transitive dependencies
>> to leak into user code. For backwards compatibility we can have a whitelist
>> if we want but I'd be

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin

In the interim, you can just build it off branch-1.4 if you want.


On Fri, Nov 13, 2015 at 1:30 PM, Reynold Xin  wrote:

> I actually tried to build a binary for 1.4.2 and wanted to start voting,
> but there was an issue with the release script that failed the jenkins job.
> Would be great to kick off a 1.4.2 release.
>
>
> On Fri, Nov 13, 2015 at 1:00 PM, Andrew Lee  wrote:
>
>> Hi All,
>>
>>
>> I'm wondering if Spark 1.4.2 had been voted by any chance or if I have
>> overlooked and we are targeting 1.4.3?
>>
>>
>> By looking at the JIRA
>>
>>
>> https://issues.apache.org/jira/browse/SPARK/fixforversion/12332833/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
>>
>>
>> All issues were resolved and no blockers. Anyone knows what happened to
>> this release?
>>
>>
>>
>> or was there any recommendation to skip that and ask users to use Spark
>> 1.5.2 instead?
>>
>
>

Re: [build system] short jenkins downtime tomorrow morning, 11-13-2015 @ 7am PST

2015-11-13 Thread shane knapp

this is happening now.

On Thu, Nov 12, 2015 at 12:14 PM, shane knapp  wrote:
> i will admit that it does seem like a bad idea to poke jenkins on
> friday the 13th, but there's a release that fixes a lot of security
> issues:
>
> https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-11-11
>
> i'll set jenkins to stop kicking off any new builds around 5am PST,
> and will upgrade and restart jenkins around 7am PST.  barring anything
> horrible happening, we should be back up and building by 730am.
>
> ...and this time, i promise not to touch any of the plugins.  :)
>
> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [build system] short jenkins downtime tomorrow morning, 11-13-2015 @ 7am PST

Re: Seems jenkins is down (or very slow)?

Re: [build system] short jenkins downtime tomorrow morning, 11-13-2015 @ 7am PST

Re: A proposal for Spark 2.0

let spark streaming sample come to stop

Re: SparkPullRequestBuilder coverage

Re: Spark 1.4.2 release and votes conversation?

Incubator Proposal for Spark-Kernel

Re: SparkPullRequestBuilder coverage

Re: pyspark with pypy not work for spark-1.5.1

Re: SparkPullRequestBuilder coverage

Spark 1.4.2 release and votes conversation?

Problem with Breadcast variable not deserialized

Re: Seems jenkins is down (or very slow)?

Re: A proposal for Spark 2.0

Re: Spark 1.4.2 release and votes conversation?

Re: [build system] short jenkins downtime tomorrow morning, 11-13-2015 @ 7am PST

17 matches

Site Navigation

Mail list logo

Footer information