Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Hi all,

I am writing this email to both user-group and dev-group since this is
applicable to both.

I am now working on Spark XML datasource (
https://github.com/databricks/spark-xml).
This uses a InputFormat implementation which I downgraded to Hadoop 1.x for
version compatibility.

However, I found all the internal JSON datasource and others in Databricks
use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the
method for this because TaskAttemptContext is a class in Hadoop 1.x and an
interface in Hadoop 2.x.

So, I looked through the codes for some advantages for Hadoop 2.x API but I
couldn't.
I wonder if there are some advantages for using Hadoop 2.x API.

I understand that it is still preferable to use Hadoop 2.x APIs at least
for future differences but somehow I feel like it might not have to use
Hadoop 2.x by reflecting a method.

I would appreciate that if you leave a comment here
https://github.com/databricks/spark-xml/pull/14 as well as sending back a
reply if there is a good explanation

Thanks!


Re: A proposal for Spark 2.0

2015-12-09 Thread kostas papageorgopoylos
Hi Kostas

With regards to your *second* point. I believe that requiring from the user
apps to explicitly declare their dependencies is the most clear API
approach when it comes to classpath and classloading.

However what about the following API: *SparkContext.addJar(String
pathToJar)* . *Is this going to change or affected in someway?*
Currently i use spark 1.5.2 in a Java application and i have built a
utility class that finds the correct path of a Dependency
(myPathOfTheJarDependency=Something like SparkUtils.getJarFullPathFromClass
(EsSparkSQL.class, "^elasticsearch-hadoop-2.2.0-beta1.*\\.jar$");), Which
is not something beatiful but i can live with.

Then i  use *javaSparkContext.addJar(myPathOfTheJarDependency)* ; after i
have initiated the javaSparkContext. In that way i do not require my
SparkCluster to have configuration on the classpath of my application and i
explicitly define the dependencies during runtime of my app after each time
i initiate a sparkContext.
I would be happy and i believe many other users also if i could could
continue having the same or similar approach with regards to dependencies


Regards

2015-12-08 23:40 GMT+02:00 Kostas Sakellis :

> I'd also like to make it a requirement that Spark 2.0 have a stable
> dataframe and dataset API - we should not leave these APIs experimental in
> the 2.0 release. We already know of at least one breaking change we need to
> make to dataframes, now's the time to make any other changes we need to
> stabilize these APIs. Anything we can do to make us feel more comfortable
> about the dataset and dataframe APIs before the 2.0 release?
>
> I've also been thinking that in Spark 2.0, we might want to consider
> strict classpath isolation for user applications. Hadoop 3 is moving in
> this direction. We could, for instance, run all user applications in their
> own classloader that only inherits very specific classes from Spark (ie.
> public APIs). This will require user apps to explicitly declare their
> dependencies as there won't be any accidental class leaking anymore. We do
> something like this for *userClasspathFirst option but it is not as strict
> as what I described. This is a breaking change but I think it will help
> with eliminating weird classpath incompatibility issues between user
> applications and Spark system dependencies.
>
> Thoughts?
>
> Kostas
>
>
> On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen  wrote:
>
>> To be clear-er, I don't think it's clear yet whether a 1.7 release
>> should exist or not. I could see both making sense. It's also not
>> really necessary to decide now, well before a 1.6 is even out in the
>> field. Deleting the version lost information, and I would not have
>> done that given my reply. Reynold maybe I can take this up with you
>> offline.
>>
>> On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra 
>> wrote:
>> > Reynold's post fromNov. 25:
>> >
>> >> I don't think we should drop support for Scala 2.10, or make it harder
>> in
>> >> terms of operations for people to upgrade.
>> >>
>> >> If there are further objections, I'm going to bump remove the 1.7
>> version
>> >> and retarget things to 2.0 on JIRA.
>> >
>> >
>> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen  wrote:
>> >>
>> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
>> >> think that's premature. If there's a 1.7.0 then we've lost info about
>> >> what it would contain. It's trivial at any later point to merge the
>> >> versions. And, since things change and there's not a pressing need to
>> >> decide one way or the other, it seems fine to at least collect this
>> >> info like we have things like "1.4.3" that may never be released. I'd
>> >> like to add it back?
>> >>
>> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen  wrote:
>> >> > Maintaining both a 1.7 and 2.0 is too much work for the project,
>> which
>> >> > is over-stretched now. This means that after 1.6 it's just small
>> >> > maintenance releases in 1.x and no substantial features or evolution.
>> >> > This means that the "in progress" APIs in 1.x that will stay that
>> way,
>> >> > unless one updates to 2.x. It's not unreasonable, but means the
>> update
>> >> > to the 2.x line isn't going to be that optional for users.
>> >> >
>> >> > Scala 2.10 is already EOL right? Supporting it in 2.x means
>> supporting
>> >> > it for a couple years, note. 2.10 is still used today, but that's the
>> >> > point of the current stable 1.x release in general: if you want to
>> >> > stick to current dependencies, stick to the current release. Although
>> >> > I think that's the right way to think about support across major
>> >> > versions in general, I can see that 2.x is more of a required update
>> >> > for those following the project's fixes and releases. Hence may
>> indeed
>> >> > be important to just keep supporting 2.10.
>> >> >
>> >> > I can't see supporting 2.12 at the same time 

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API.

but it’s not a big issue for your change, only 
com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java 

 need to change, right?

It’s not a big change to 2.x API. if you agree, I can do, but I cannot promise 
the time within one or two weeks because of my daily job.





> On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon  wrote:
> 
> Hi all, 
> 
> I am writing this email to both user-group and dev-group since this is 
> applicable to both.
> 
> I am now working on Spark XML datasource 
> (https://github.com/databricks/spark-xml 
> ).
> This uses a InputFormat implementation which I downgraded to Hadoop 1.x for 
> version compatibility.
> 
> However, I found all the internal JSON datasource and others in Databricks 
> use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the 
> method for this because TaskAttemptContext is a class in Hadoop 1.x and an 
> interface in Hadoop 2.x.
> 
> So, I looked through the codes for some advantages for Hadoop 2.x API but I 
> couldn't.
> I wonder if there are some advantages for using Hadoop 2.x API.
> 
> I understand that it is still preferable to use Hadoop 2.x APIs at least for 
> future differences but somehow I feel like it might not have to use Hadoop 
> 2.x by reflecting a method.
> 
> I would appreciate that if you leave a comment here 
> https://github.com/databricks/spark-xml/pull/14 
>  as well as sending back a 
> reply if there is a good explanation
> 
> Thanks! 



Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Thank you for your reply!

I have already done the change locally. So for changing it would be fine.

I just wanted to be sure which way is correct.
On 9 Dec 2015 18:20, "Fengdong Yu"  wrote:

> I don’t think there is performance difference between 1.x API and 2.x API.
>
> but it’s not a big issue for your change, only
> com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java
> 
>  need to change, right?
>
> It’s not a big change to 2.x API. if you agree, I can do, but I cannot
> promise the time within one or two weeks because of my daily job.
>
>
>
>
>
> On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon  wrote:
>
> Hi all,
>
> I am writing this email to both user-group and dev-group since this is
> applicable to both.
>
> I am now working on Spark XML datasource (
> https://github.com/databricks/spark-xml).
> This uses a InputFormat implementation which I downgraded to Hadoop 1.x
> for version compatibility.
>
> However, I found all the internal JSON datasource and others in Databricks
> use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the
> method for this because TaskAttemptContext is a class in Hadoop 1.x and an
> interface in Hadoop 2.x.
>
> So, I looked through the codes for some advantages for Hadoop 2.x API but
> I couldn't.
> I wonder if there are some advantages for using Hadoop 2.x API.
>
> I understand that it is still preferable to use Hadoop 2.x APIs at least
> for future differences but somehow I feel like it might not have to use
> Hadoop 2.x by reflecting a method.
>
> I would appreciate that if you leave a comment here
> https://github.com/databricks/spark-xml/pull/14 as well as sending back a
> reply if there is a good explanation
>
> Thanks!
>
>
>


SQL language vs DataFrame API

2015-12-09 Thread Cristian O
Hi,

I was wondering what the "official" view is on feature parity between SQL
and DF apis. Docs are pretty sparse on the SQL front, and it seems that
some features are only supported at various times in only one of Spark SQL
dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
are some examples

Is there an explicit goal of having consistent support for all features in
both DF and SQL ?

Thanks,
Cristian


Re: [build system] jenkins downtime, thursday 12/10/15 7am PDT

2015-12-09 Thread shane knapp
reminder!  this is happening tomorrow morning.

On Wed, Dec 2, 2015 at 7:20 PM, shane knapp  wrote:
> there's Yet Another Jenkins Security Advisory[tm], and a big release
> to patch it all coming out next wednesday.
>
> to that end i will be performing a jenkins update, as well as
> performing the work to resolve the following jira issue:
> https://issues.apache.org/jira/browse/SPARK-11255
>
> i will put jenkins in to quiet mode around 6am, start work around 7am
> and expect everything to be back up and building before 9am.  i'll
> post updates as things progress.
>
> please let me know ASAP if there's any problem with this schedule.
>
> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SQL language vs DataFrame API

2015-12-09 Thread Michael Armbrust
I don't plan to abandon HiveQL compatibility, but I'd like to see us move
towards something with more SQL compliance (perhaps just newer versions of
the HiveQL parser).  Exactly which parser will do that for us is under
investigation.

On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li  wrote:

> Hi, Michael,
>
> Does that mean SqlContext will be built on HiveQL in the near future?
>
> Thanks,
>
> Xiao Li
>
>
> 2015-12-09 10:36 GMT-08:00 Michael Armbrust :
>
>> I think that it is generally good to have parity when the functionality
>> is useful.  However, in some cases various features are there just to
>> maintain compatibility with other system.  For example CACHE TABLE is eager
>> because Shark's cache table was.  df.cache() is lazy because Spark's cache
>> is.  Does that mean that we need to add some eager caching mechanism to
>> dataframes to have parity?  Probably not, users can just call .count() if
>> they want to force materialization.
>>
>> Regarding the differences between HiveQL and the SQLParser, I think we
>> should get rid of the SQL parser.  Its kind of a hack that I built just so
>> that there was some SQL story for people who didn't compile with Hive.
>> Moving forward, I'd like to see the distinction between the HiveContext and
>> SQLContext removed and we can standardize on a single parser.  For this
>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>> features there.
>>
>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>> cristian.b.op...@googlemail.com> wrote:
>>
>>> Hi,
>>>
>>> I was wondering what the "official" view is on feature parity between
>>> SQL and DF apis. Docs are pretty sparse on the SQL front, and it seems that
>>> some features are only supported at various times in only one of Spark SQL
>>> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
>>> are some examples
>>>
>>> Is there an explicit goal of having consistent support for all features
>>> in both DF and SQL ?
>>>
>>> Thanks,
>>> Cristian
>>>
>>
>>
>


Re: Fastest way to build Spark from scratch

2015-12-09 Thread Josh Rosen
Yeah, this is the same idea behind having Travis cache the ivy2 folder to
speed up builds. In Amplab Jenkins each individual build workspace has its
own individual Ivy cache which is preserved across build runs but which is
only used by one active run at a time in order to avoid SBT ivy lock
contention (this shouldn't be an issue in most environments though).
On Tue, Dec 8, 2015 at 10:32 AM Nicholas Chammas 
wrote:

> Interesting. As long as Spark's dependencies don't change that often, the
> same caches could save "from scratch" build time over many months of Spark
> development. Is that right?
>
>
> On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen 
> wrote:
>
>> @Nick, on a fresh EC2 instance a significant chunk of the initial build
>> time might be due to artifact resolution + downloading. Putting
>> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
>> decent chunk of time off that first build.
>>
>> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Thanks for the tips, Jakob and Steve.
>>>
>>> It looks like my original approach is the best for me since I'm
>>> installing Spark on newly launched EC2 instances and can't take advantage
>>> of incremental compilation.
>>>
>>> Nick
>>>
>>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran 
>>> wrote:
>>>
 On 7 Dec 2015, at 19:07, Jakob Odersky  wrote:

 make-distribution and the second code snippet both create a
 distribution from a clean state. They therefore require that every source
 file be compiled and that takes time (you can maybe tweak some settings or
 use a newer compiler to gain some speed).

 I'm inferring from your question that for your use-case deployment
 speed is a critical issue, furthermore you'd like to build Spark for lots
 of (every?) commit in a systematic way. In that case I would suggest you
 try using the second code snippet without the `clean` task and only resort
 to it if the build fails.

 On my local machine, an assembly without a clean drops from 6 minutes
 to 2.

 regards,
 --Jakob


 1. you can use zinc -where possible- to speed up scala compilations
 2. you might also consider setting up a local jenkins VM, hooked to
 whatever git repo & branch you are working off, and have it do the builds
 and tests for you. Not so great for interactive dev,

 finally, on the mac, the "say" command is pretty handy at letting you
 know when some work in a terminal is ready, so you can do the
 first-thing-in-the morning build-of-the-SNAPSHOTS

 mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say
 moo

 After that you can work on the modules you care about (via the -pl)
 option). That doesn't work if you are running on an EC2 instance though




 On 23 November 2015 at 20:18, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> Say I want to build a complete Spark distribution against Hadoop 2.6+
> as fast as possible from scratch.
>
> This is what I’m doing at the moment:
>
> ./make-distribution.sh -T 1C -Phadoop-2.6
>
> -T 1C instructs Maven to spin up 1 thread per available core. This
> takes around 20 minutes on an m3.large instance.
>
> I see that spark-ec2, on the other hand, builds Spark as follows
> 
> when you deploy Spark at a specific git commit:
>
> sbt/sbt clean assembly
> sbt/sbt publish-local
>
> This seems slower than using make-distribution.sh, actually.
>
> Is there a faster way to do this?
>
> Nick
> ​
>



>>


Re: SQL language vs DataFrame API

2015-12-09 Thread Xiao Li
Hi, Michael,

Does that mean SqlContext will be built on HiveQL in the near future?

Thanks,

Xiao Li


2015-12-09 10:36 GMT-08:00 Michael Armbrust :

> I think that it is generally good to have parity when the functionality is
> useful.  However, in some cases various features are there just to maintain
> compatibility with other system.  For example CACHE TABLE is eager because
> Shark's cache table was.  df.cache() is lazy because Spark's cache is.
> Does that mean that we need to add some eager caching mechanism to
> dataframes to have parity?  Probably not, users can just call .count() if
> they want to force materialization.
>
> Regarding the differences between HiveQL and the SQLParser, I think we
> should get rid of the SQL parser.  Its kind of a hack that I built just so
> that there was some SQL story for people who didn't compile with Hive.
> Moving forward, I'd like to see the distinction between the HiveContext and
> SQLContext removed and we can standardize on a single parser.  For this
> reason I'd be opposed to spending a lot of dev/reviewer time on adding
> features there.
>
> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
> cristian.b.op...@googlemail.com> wrote:
>
>> Hi,
>>
>> I was wondering what the "official" view is on feature parity between SQL
>> and DF apis. Docs are pretty sparse on the SQL front, and it seems that
>> some features are only supported at various times in only one of Spark SQL
>> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
>> are some examples
>>
>> Is there an explicit goal of having consistent support for all features
>> in both DF and SQL ?
>>
>> Thanks,
>> Cristian
>>
>
>


DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
hi,

I met following exception when the driver program tried to recover from
checkpoint, looks like the logic relies on zeroTime being set which doesn't
seem to happen here. am I missing anything or is it a bug in 1.4.1?

org.apache.spark.SparkException:
org.apache.spark.streaming.api.csharp.CSharpTransformed2DStream@161f0d27
has not been initialized
at
org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:321)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
at
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at
scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:227)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:222)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:222)
at
org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:92)
at
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:73)
at
org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:596)
at
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:594)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.api.csharp.CSharpBackendHandler.handleMethodCall(CSharpBackendHandler.scala:145)
at
org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:90)
at
org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:25)
at
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:724)


RE: Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Sun, Rui
Hi,

Just use  ""objectFile" instead of "objectFile[PipelineModel]" for callJMethod. 
You can take the objectFile() in context.R as example.

Since the SparkContext created in SparkR is actually a JavaSparkContext, there 
is no need to pass the implicit ClassTag.

-Original Message-
From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] 
Sent: Thursday, December 10, 2015 8:21 AM
To: Chris Freeman
Cc: dev@spark.apache.org
Subject: Re: Specifying Scala types when calling methods from SparkR

The SparkR callJMethod can only invoke methods as they show up in the Java byte 
code. So in this case you'll need to check the SparkContext byte code (with 
javap or something like that) to see how that method looks. My guess is the 
type is passed in as a class tag argument, so you'll need to do something like 
create a class tag for the LinearRegressionModel and pass that in as the first 
or last argument etc.

Thanks
Shivaram

On Wed, Dec 9, 2015 at 10:11 AM, Chris Freeman  wrote:
> Hey everyone,
>
> I’m currently looking at ways to save out SparkML model objects from 
> SparkR and I’ve had some luck putting the model into an RDD and then 
> saving the RDD as an Object File. Once it’s saved, I’m able to load it 
> back in with something like:
>
> sc.objectFile[LinearRegressionModel](“path/to/model”)
>
> I’d like to try and replicate this same process from SparkR using the 
> JVM backend APIs (e.g. “callJMethod”), but so far I haven’t been able 
> to replicate my success and I’m guessing that it’s (at least in part) 
> due to the necessity of specifying the type when calling the objectFile 
> method.
>
> Does anyone know if this is actually possible? For example, here’s 
> what I’ve come up with so far:
>
> loadModel <- function(sc, modelPath) {
>   modelRDD <- SparkR:::callJMethod(sc,
>
> "objectFile[PipelineModel]",
> modelPath,
> SparkR:::getMinPartitions(sc, NULL))
>   return(modelRDD)
> }
>
> Any help is appreciated!
>
> --
> Chris Freeman
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org



Re: SQL language vs DataFrame API

2015-12-09 Thread Stephen Boesch
Is this a candidate for the version 1.X/2.0 split?

2015-12-09 16:29 GMT-08:00 Michael Armbrust :

> Yeah, I would like to address any actual gaps in functionality that are
> present.
>
> On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris  > wrote:
>
>> The reason I'm asking is because it's important in larger projects to be
>> able to stick to a particular programming style. Some people are more
>> comfortable with SQL, others might find the DF api more suitable, but it's
>> important to have full expressivity in both to make it easier to adopt one
>> approach rather than have to mix and match to achieve full functionality.
>>
>> On 9 December 2015 at 19:41, Xiao Li  wrote:
>>
>>> That sounds great! When it is decided, please let us know and we can add
>>> more features and make it ANSI SQL compliant.
>>>
>>> Thank you!
>>>
>>> Xiao Li
>>>
>>>
>>> 2015-12-09 11:31 GMT-08:00 Michael Armbrust :
>>>
 I don't plan to abandon HiveQL compatibility, but I'd like to see us
 move towards something with more SQL compliance (perhaps just newer
 versions of the HiveQL parser).  Exactly which parser will do that for us
 is under investigation.

 On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li  wrote:

> Hi, Michael,
>
> Does that mean SqlContext will be built on HiveQL in the near future?
>
> Thanks,
>
> Xiao Li
>
>
> 2015-12-09 10:36 GMT-08:00 Michael Armbrust :
>
>> I think that it is generally good to have parity when the
>> functionality is useful.  However, in some cases various features are 
>> there
>> just to maintain compatibility with other system.  For example CACHE 
>> TABLE
>> is eager because Shark's cache table was.  df.cache() is lazy because
>> Spark's cache is.  Does that mean that we need to add some eager caching
>> mechanism to dataframes to have parity?  Probably not, users can just 
>> call
>> .count() if they want to force materialization.
>>
>> Regarding the differences between HiveQL and the SQLParser, I think
>> we should get rid of the SQL parser.  Its kind of a hack that I built 
>> just
>> so that there was some SQL story for people who didn't compile with Hive.
>> Moving forward, I'd like to see the distinction between the HiveContext 
>> and
>> SQLContext removed and we can standardize on a single parser.  For this
>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>> features there.
>>
>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>> cristian.b.op...@googlemail.com> wrote:
>>
>>> Hi,
>>>
>>> I was wondering what the "official" view is on feature parity
>>> between SQL and DF apis. Docs are pretty sparse on the SQL front, and it
>>> seems that some features are only supported at various times in only 
>>> one of
>>> Spark SQL dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY,
>>> CACHE LAZY are some examples
>>>
>>> Is there an explicit goal of having consistent support for all
>>> features in both DF and SQL ?
>>>
>>> Thanks,
>>> Cristian
>>>
>>
>>
>

>>>
>>
>


Re: DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
never mind, one of my peers correct the driver program for me - all dstream
operations need to be within the scope of getOrCreate API

On Wed, Dec 9, 2015 at 3:32 PM, Renyi Xiong  wrote:

> following scala program throws same exception, I know people are running
> streaming jobs against kafka, I must be missing something. any idea why?
>
> package org.apache.spark.streaming.api.csharp
>
> import java.util.HashMap
>
> import kafka.serializer.{DefaultDecoder, Decoder, StringDecoder}
>
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.kafka._
> import org.apache.spark.SparkConf
>
> object ScalaSML {
>   def main(args: Array[String]) {
>
>  val checkpointPath =
> "hdfs://SparkMasterVIP.AdsOISCP-Sandbox-Ch1d.CH1D.ap.gbl/checkpoint/ScalaSML/HK2"
> val sparkConf = new SparkConf().setAppName("ScalaSML")
> val ssc = StreamingContext.getOrCreate(checkpointPath, () => {
>   val context = new StreamingContext(sparkConf, Seconds(60))
>   context.checkpoint(checkpointPath)
>   context
>   })
>
>  val kafkaParams = Map("metadata.broker.list" ->  "...",
>   "auto.offset.reset" -> "largest")
>
>  val topics = Set("topic")
>
> val ds = KafkaUtils.createDirectStream[Array[Byte], Array[Byte],
> DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topics)
>  ds.foreachRDD((rdd, time) => println("Time: " + time + " Count: " +
> rdd.count()))
>
> ssc.start()
> ssc.awaitTermination()
>   }
> }
>
> 15/12/09 15:22:43 ERROR StreamingContext: Error starting the context,
> marking it as stopped
> org.apache.spark.SparkException:
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream@15ce2c0 has not
> been initialized
> at
> org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:321)
> at
> org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:83)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
> at
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at
> scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:227)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:222)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at
> org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:222)
> at
> org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:92)
> at
> org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:73)
> at
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:593)
> at
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:591)
> at
> org.apache.spark.streaming.api.csharp.ScalaSML$.main(ScalaSML.scala:48)
> at
> org.apache.spark.streaming.api.csharp.ScalaSML.main(ScalaSML.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> On 

Re: Specifying Scala types when calling methods from SparkR

2015-12-09 Thread Shivaram Venkataraman
The SparkR callJMethod can only invoke methods as they show up in the
Java byte code. So in this case you'll need to check the SparkContext
byte code (with javap or something like that) to see how that method
looks. My guess is the type is passed in as a class tag argument, so
you'll need to do something like create a class tag for the
LinearRegressionModel and pass that in as the first or last argument
etc.

Thanks
Shivaram

On Wed, Dec 9, 2015 at 10:11 AM, Chris Freeman  wrote:
> Hey everyone,
>
> I’m currently looking at ways to save out SparkML model objects from SparkR
> and I’ve had some luck putting the model into an RDD and then saving the RDD
> as an Object File. Once it’s saved, I’m able to load it back in with
> something like:
>
> sc.objectFile[LinearRegressionModel](“path/to/model”)
>
> I’d like to try and replicate this same process from SparkR using the JVM
> backend APIs (e.g. “callJMethod”), but so far I haven’t been able to
> replicate my success and I’m guessing that it’s (at least in part) due to
> the necessity of specifying the type when calling the objectFile method.
>
> Does anyone know if this is actually possible? For example, here’s what I’ve
> come up with so far:
>
> loadModel <- function(sc, modelPath) {
>   modelRDD <- SparkR:::callJMethod(sc,
>
> "objectFile[PipelineModel]",
> modelPath,
> SparkR:::getMinPartitions(sc, NULL))
>   return(modelRDD)
> }
>
> Any help is appreciated!
>
> --
> Chris Freeman
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SQL language vs DataFrame API

2015-12-09 Thread Michael Armbrust
Yeah, I would like to address any actual gaps in functionality that are
present.

On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris 
wrote:

> The reason I'm asking is because it's important in larger projects to be
> able to stick to a particular programming style. Some people are more
> comfortable with SQL, others might find the DF api more suitable, but it's
> important to have full expressivity in both to make it easier to adopt one
> approach rather than have to mix and match to achieve full functionality.
>
> On 9 December 2015 at 19:41, Xiao Li  wrote:
>
>> That sounds great! When it is decided, please let us know and we can add
>> more features and make it ANSI SQL compliant.
>>
>> Thank you!
>>
>> Xiao Li
>>
>>
>> 2015-12-09 11:31 GMT-08:00 Michael Armbrust :
>>
>>> I don't plan to abandon HiveQL compatibility, but I'd like to see us
>>> move towards something with more SQL compliance (perhaps just newer
>>> versions of the HiveQL parser).  Exactly which parser will do that for us
>>> is under investigation.
>>>
>>> On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li  wrote:
>>>
 Hi, Michael,

 Does that mean SqlContext will be built on HiveQL in the near future?

 Thanks,

 Xiao Li


 2015-12-09 10:36 GMT-08:00 Michael Armbrust :

> I think that it is generally good to have parity when the
> functionality is useful.  However, in some cases various features are 
> there
> just to maintain compatibility with other system.  For example CACHE TABLE
> is eager because Shark's cache table was.  df.cache() is lazy because
> Spark's cache is.  Does that mean that we need to add some eager caching
> mechanism to dataframes to have parity?  Probably not, users can just call
> .count() if they want to force materialization.
>
> Regarding the differences between HiveQL and the SQLParser, I think we
> should get rid of the SQL parser.  Its kind of a hack that I built just so
> that there was some SQL story for people who didn't compile with Hive.
> Moving forward, I'd like to see the distinction between the HiveContext 
> and
> SQLContext removed and we can standardize on a single parser.  For this
> reason I'd be opposed to spending a lot of dev/reviewer time on adding
> features there.
>
> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
> cristian.b.op...@googlemail.com> wrote:
>
>> Hi,
>>
>> I was wondering what the "official" view is on feature parity between
>> SQL and DF apis. Docs are pretty sparse on the SQL front, and it seems 
>> that
>> some features are only supported at various times in only one of Spark 
>> SQL
>> dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
>> are some examples
>>
>> Is there an explicit goal of having consistent support for all
>> features in both DF and SQL ?
>>
>> Thanks,
>> Cristian
>>
>
>

>>>
>>
>


Re: DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
following scala program throws same exception, I know people are running
streaming jobs against kafka, I must be missing something. any idea why?

package org.apache.spark.streaming.api.csharp

import java.util.HashMap

import kafka.serializer.{DefaultDecoder, Decoder, StringDecoder}

import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf

object ScalaSML {
  def main(args: Array[String]) {

 val checkpointPath =
"hdfs://SparkMasterVIP.AdsOISCP-Sandbox-Ch1d.CH1D.ap.gbl/checkpoint/ScalaSML/HK2"
val sparkConf = new SparkConf().setAppName("ScalaSML")
val ssc = StreamingContext.getOrCreate(checkpointPath, () => {
  val context = new StreamingContext(sparkConf, Seconds(60))
  context.checkpoint(checkpointPath)
  context
  })

 val kafkaParams = Map("metadata.broker.list" ->  "...",
  "auto.offset.reset" -> "largest")

 val topics = Set("topic")

val ds = KafkaUtils.createDirectStream[Array[Byte], Array[Byte],
DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topics)
 ds.foreachRDD((rdd, time) => println("Time: " + time + " Count: " +
rdd.count()))

ssc.start()
ssc.awaitTermination()
  }
}

15/12/09 15:22:43 ERROR StreamingContext: Error starting the context,
marking it as stopped
org.apache.spark.SparkException:
org.apache.spark.streaming.kafka.DirectKafkaInputDStream@15ce2c0 has not
been initialized
at
org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:321)
at
org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:83)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
at
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
at
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:120)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at
scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:120)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:227)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:222)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:222)
at
org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:92)
at
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:73)
at
org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:593)
at
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:591)
at
org.apache.spark.streaming.api.csharp.ScalaSML$.main(ScalaSML.scala:48)
at
org.apache.spark.streaming.api.csharp.ScalaSML.main(ScalaSML.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

On Wed, Dec 9, 2015 at 12:45 PM, Renyi Xiong  wrote:

> hi,
>
> I met following exception when the driver program tried to recover from
> checkpoint, looks like the logic relies on zeroTime being set which doesn't
> seem to happen here. am I missing anything or is it a bug in 1.4.1?
>
> org.apache.spark.SparkException:
> org.apache.spark.streaming.api.csharp.CSharpTransformed2DStream@161f0d27
> has not 

Re: SQL language vs DataFrame API

2015-12-09 Thread Xiao Li
That sounds great! When it is decided, please let us know and we can add
more features and make it ANSI SQL compliant.

Thank you!

Xiao Li


2015-12-09 11:31 GMT-08:00 Michael Armbrust :

> I don't plan to abandon HiveQL compatibility, but I'd like to see us move
> towards something with more SQL compliance (perhaps just newer versions of
> the HiveQL parser).  Exactly which parser will do that for us is under
> investigation.
>
> On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li  wrote:
>
>> Hi, Michael,
>>
>> Does that mean SqlContext will be built on HiveQL in the near future?
>>
>> Thanks,
>>
>> Xiao Li
>>
>>
>> 2015-12-09 10:36 GMT-08:00 Michael Armbrust :
>>
>>> I think that it is generally good to have parity when the functionality
>>> is useful.  However, in some cases various features are there just to
>>> maintain compatibility with other system.  For example CACHE TABLE is eager
>>> because Shark's cache table was.  df.cache() is lazy because Spark's cache
>>> is.  Does that mean that we need to add some eager caching mechanism to
>>> dataframes to have parity?  Probably not, users can just call .count() if
>>> they want to force materialization.
>>>
>>> Regarding the differences between HiveQL and the SQLParser, I think we
>>> should get rid of the SQL parser.  Its kind of a hack that I built just so
>>> that there was some SQL story for people who didn't compile with Hive.
>>> Moving forward, I'd like to see the distinction between the HiveContext and
>>> SQLContext removed and we can standardize on a single parser.  For this
>>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>>> features there.
>>>
>>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>>> cristian.b.op...@googlemail.com> wrote:
>>>
 Hi,

 I was wondering what the "official" view is on feature parity between
 SQL and DF apis. Docs are pretty sparse on the SQL front, and it seems that
 some features are only supported at various times in only one of Spark SQL
 dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY, CACHE LAZY
 are some examples

 Is there an explicit goal of having consistent support for all features
 in both DF and SQL ?

 Thanks,
 Cristian

>>>
>>>
>>
>


Re: [build system] jenkins downtime, thursday 12/10/15 7am PDT

2015-12-09 Thread shane knapp
here's the security advisory for the update:
https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2015-12-09

On Wed, Dec 9, 2015 at 9:55 AM, shane knapp  wrote:
> reminder!  this is happening tomorrow morning.
>
> On Wed, Dec 2, 2015 at 7:20 PM, shane knapp  wrote:
>> there's Yet Another Jenkins Security Advisory[tm], and a big release
>> to patch it all coming out next wednesday.
>>
>> to that end i will be performing a jenkins update, as well as
>> performing the work to resolve the following jira issue:
>> https://issues.apache.org/jira/browse/SPARK-11255
>>
>> i will put jenkins in to quiet mode around 6am, start work around 7am
>> and expect everything to be back up and building before 9am.  i'll
>> post updates as things progress.
>>
>> please let me know ASAP if there's any problem with this schedule.
>>
>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org