Re: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Yi Zhang
@Hao,Because the querying joined more than one table, if I register data frame 
as temp table, Spark can't disguise which table is correct. I don't how to set 
dbtable and register temp table. 
Any suggestion? 


 On Friday, May 15, 2015 1:38 PM, Cheng, Hao hao.ch...@intel.com wrote:
   

 #yiv0864379581 #yiv0864379581 -- _filtered #yiv0864379581 
{font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv0864379581 
{font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv0864379581 
{panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv0864379581 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv0864379581 
{panose-1:2 11 5 3 2 2 4 2 2 4;} _filtered #yiv0864379581 {panose-1:2 1 6 0 3 1 
1 1 1 1;} _filtered #yiv0864379581 {font-family:Consolas;panose-1:2 11 6 9 2 2 
4 3 2 4;} _filtered #yiv0864379581 {panose-1:2 11 5 3 2 2 4 2 2 4;} _filtered 
#yiv0864379581 {font-family:Georgia;panose-1:2 4 5 2 5 4 5 2 3 3;} _filtered 
#yiv0864379581 {font-family:Monaco;panose-1:0 0 0 0 0 0 0 0 0 0;}#yiv0864379581 
#yiv0864379581 p.yiv0864379581MsoNormal, #yiv0864379581 
li.yiv0864379581MsoNormal, #yiv0864379581 div.yiv0864379581MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0864379581 a:link, 
#yiv0864379581 span.yiv0864379581MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv0864379581 a:visited, #yiv0864379581 
span.yiv0864379581MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv0864379581 code {}#yiv0864379581 
pre {margin:0cm;margin-bottom:.0001pt;font-size:10.0pt;}#yiv0864379581 
span.yiv0864379581HTMLPreformattedChar {font-family:Consolas;}#yiv0864379581 
span {}#yiv0864379581 span.yiv0864379581link-enhancr-element {}#yiv0864379581 
span.yiv0864379581link-enhancr-view-on-domain {}#yiv0864379581 
span.yiv0864379581EmailStyle23 {color:#1F497D;}#yiv0864379581 
.yiv0864379581MsoChpDefault {font-size:10.0pt;} _filtered #yiv0864379581 
{margin:72.0pt 90.0pt 72.0pt 90.0pt;}#yiv0864379581 
div.yiv0864379581WordSection1 {}#yiv0864379581 You need to register the 
“dataFrame” as a table first and then do queries on it? Do you mean that also 
failed?    From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Friday, May 15, 2015 1:10 PM
To: Yi Zhang; Dev
Subject: Re: Does Spark SQL (JDBC) support nest select with current version    
If I pass the whole statement as dbtable to sqlContext.load() method as below: 
val query =
  
    (select t1._salory as salory,
    |t1._name as employeeName,
    |(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
    |from mock_employees t1
    |inner join mock_locations t2
    |on t1._location_id = t2._id
    |where t1._salory  t2._max_price) EMP
  .stripMargin
val dataFrame = sqlContext.load(jdbc, Map(
  url - url,
  driver - com.mysql.jdbc.Driver,
  dbtable - query
))    It works. However, I can't invoke sql() method to solve this problem. And 
why?          On Friday, May 15, 2015 11:33 AM, Yi Zhang 
zhangy...@yahoo.com.INVALID wrote:    The sql statement is like this: select 
t1._salory as salory,
t1._name as employeeName,
(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
from mock_employees t1
inner join mock_locations t2
on t1._location_id = t2._id
where t1._salory  t2._max_price    I noticed the issue [SPARK-4226] SparkSQL - 
Add support for subqueries in predicates - ASF JIRA is still in the progress. 
And somebody commented it that Spark 1.3 would support it. So I don't know 
current status for this feature.  Thanks.    Regards, Yi 
|    |
|    |    |    |    |    |  |
| [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRA 
java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug 
TOK_INSERT TOK_DE...  |  |
|  |  |
| View on issues.apache.org  | Preview by Yahoo  |  |
|  |  |
|    |
|  |  |  |  |  |  |  |

     

  

Re: How to link code pull request with JIRA ID?

2015-05-14 Thread Patrick Wendell
Yeah I wrote the original script and I intentionally made it easy for
other projects to use (you'll just need to tweak some variables at the
top). You just need somewhere to run it... we were using a jenkins
cluster to run it every 5 minutes.

BTW - I looked and there is one instance where it hard cores the
string SPARK-, but that should be easy to change. I'm happy to
review a patch that makes that prefix a variable.

https://github.com/apache/spark/blob/master/dev/github_jira_sync.py#L71

- Patrick

On Thu, May 14, 2015 at 8:45 AM, Josh Rosen rosenvi...@gmail.com wrote:
 Spark PRs didn't always used to handle the JIRA linking.  We used to rely
 on a Jenkins job that ran
 https://github.com/apache/spark/blob/master/dev/github_jira_sync.py.  We
 switched this over to Spark PRs at a time when the Jenkins GitHub Pull
 Request Builder plugin was having flakiness issues, but as far as I know
 that old script should still work.

 On Wed, May 13, 2015 at 9:40 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 There's no magic to it. We're doing the same, except Josh automated it in
 the PR dashboard he created.

 https://spark-prs.appspot.com/

 Nick

 On Wed, May 13, 2015 at 6:20 PM Markus Weimer mar...@weimo.de wrote:

  Hi,
 
  how did you set this up? Over in the REEF incubation project, we
  painstakingly create the forwards- and backwards links despite having
  the IDs in the PR descriptions...
 
  Thanks!
 
  Markus
 
 
  On 2015-05-13 11:56, Ted Yu wrote:
   Subproject tag should follow SPARK JIRA number.
   e.g.
  
   [SPARK-5277][SQL] ...
  
   Cheers
  
   On Wed, May 13, 2015 at 11:50 AM, Stephen Boesch java...@gmail.com
  wrote:
  
   following up from Nicholas, it is
  
   [SPARK-12345] Your PR description
  
   where 12345 is the jira number.
  
  
   One thing I tend to forget is when/where to include the subproject tag
  e.g.
[MLLIB]
  
  
   2015-05-13 11:11 GMT-07:00 Nicholas Chammas 
 nicholas.cham...@gmail.com
  :
  
   That happens automatically when you open a PR with the JIRA key in
 the
  PR
   title.
  
   On Wed, May 13, 2015 at 2:10 PM Chandrashekhar Kotekar 
   shekhar.kote...@gmail.com wrote:
  
   Hi,
  
   I am new to open source contribution and trying to understand the
   process
   starting from pulling code to uploading patch.
  
   I have managed to pull code from GitHub. In JIRA I saw that each
 JIRA
   issue
   is connected with pull request. I would like to know how do people
   attach
   pull request details to JIRA issue?
  
   Thanks,
   Chandrash3khar Kotekar
   Mobile - +91 8600011455
  
  
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to link code pull request with JIRA ID?

2015-05-14 Thread Josh Rosen
Spark PRs didn't always used to handle the JIRA linking.  We used to rely
on a Jenkins job that ran
https://github.com/apache/spark/blob/master/dev/github_jira_sync.py.  We
switched this over to Spark PRs at a time when the Jenkins GitHub Pull
Request Builder plugin was having flakiness issues, but as far as I know
that old script should still work.

On Wed, May 13, 2015 at 9:40 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 There's no magic to it. We're doing the same, except Josh automated it in
 the PR dashboard he created.

 https://spark-prs.appspot.com/

 Nick

 On Wed, May 13, 2015 at 6:20 PM Markus Weimer mar...@weimo.de wrote:

  Hi,
 
  how did you set this up? Over in the REEF incubation project, we
  painstakingly create the forwards- and backwards links despite having
  the IDs in the PR descriptions...
 
  Thanks!
 
  Markus
 
 
  On 2015-05-13 11:56, Ted Yu wrote:
   Subproject tag should follow SPARK JIRA number.
   e.g.
  
   [SPARK-5277][SQL] ...
  
   Cheers
  
   On Wed, May 13, 2015 at 11:50 AM, Stephen Boesch java...@gmail.com
  wrote:
  
   following up from Nicholas, it is
  
   [SPARK-12345] Your PR description
  
   where 12345 is the jira number.
  
  
   One thing I tend to forget is when/where to include the subproject tag
  e.g.
[MLLIB]
  
  
   2015-05-13 11:11 GMT-07:00 Nicholas Chammas 
 nicholas.cham...@gmail.com
  :
  
   That happens automatically when you open a PR with the JIRA key in
 the
  PR
   title.
  
   On Wed, May 13, 2015 at 2:10 PM Chandrashekhar Kotekar 
   shekhar.kote...@gmail.com wrote:
  
   Hi,
  
   I am new to open source contribution and trying to understand the
   process
   starting from pulling code to uploading patch.
  
   I have managed to pull code from GitHub. In JIRA I saw that each
 JIRA
   issue
   is connected with pull request. I would like to know how do people
   attach
   pull request details to JIRA issue?
  
   Thanks,
   Chandrash3khar Kotekar
   Mobile - +91 8600011455
  
  
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Error recovery strategies using the DirectKafkaInputDStream

2015-05-14 Thread badgerpants
We've been using the new DirectKafkaInputDStream to implement an exactly once
processing solution that tracks the provided offset ranges within the same
transaction that persists our data results. When an exception is thrown
within the processing loop and the configured number of retries are
exhausted the stream will skip to the end of the failed range of offsets and
continue on with the next  RDD. 

Makes sense but we're wondering how others would handle recovering from
failures. In our case the cause of the exception was a temporary outage of a
needed service. Since the transaction rolled back at the point of failure
our offset tracking table retained the correct offsets updated so we simply
needed to restart the Spark process whereupon it happily picked up at the
correct point and continued. Short of the restart do people have any good
ideas for how we might recover?

FWIW We've looked at setting spark.task.maxFailures param to a large value
and looked for a property that would increase the wait between attempts.
This might mitigate the issue when the availability problem is short lived
but wouldn't completely eliminate the need to restart.

Any thoughts, ideas welcome.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Error-recovery-strategies-using-the-DirectKafkaInputDStream-tp12258.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-05-14 Thread will-ob
Hey Cody (et. al.),

Few more questions related to this. It sounds like our missing data issues
appear fixed with this approach. Could you shed some light on a few
questions that came up?

-

Processing our data inside a single foreachPartition function appears to be
very different from the pattern seen in the programming guide. Does this
become problematic with additional, interleaved reduce/filter/map steps?

```
# typical?
rdd
  .map { ... }
  .reduce { ... }
  .filter { ... }
  .reduce { ... }
  .foreachRdd { writeToDb }

# with foreachPartition
rdd.foreachPartition { case (iter) =
  iter
.map { ... }
.reduce { ... }
.filter { ... }
.reduce { ... }
}

```
-

Could the above be simplified by having

one kafka partition per DStream, rather than
one kafka partition per RDD partition

?

That way, we wouldn't need to do our processing inside each partition as
there would only be one set of kafka metadata to commit.

Presumably, one could `join` DStreams when topic-level aggregates were
needed.

It seems this was the approach of Michael Noll in his blog post.
(http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/)
Although, his primary motivation appears to be maintaining high-throughput /
parallelism rather than kafka metadata.

-

From the blog post:

... there is no long-running receiver task that occupies a core per stream
regardless of what the message volume is.

Is this because data is retrieved by polling rather than maintaining a
socket? Is it still the case that there is only one receiver process per
DStream? If so, maybe it is wise to keep DStreams and Kafka partitions 1:1
.. else discover the machine's NIC limit?

Can you think of a reason not to do this? Cluster utilization, or the like,
perhaps?



And seems a silly question, but does `foreachPartition` guarantee that a
single worker will process the passed function? Or might two workers split
the work?

Eg. foreachPartition(f)

Worker 1: f( Iterator[partition 1 records 1 - 50] )
Worker 2: f( Iterator[partition 1 records 51 - 100] )

It is unclear from the scaladocs
(https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD).
But you can imagine, if it is critical that this data be committed in a
single transaction, that two workers will have issues.



-- Will O




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p12257.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-05-14 Thread Cody Koeninger
Sorry, realized I probably didn't fully answer your question about my blog
post, as opposed to Michael Nolls.

The direct stream is really blunt, a given RDD partition is just a kafka
topic/partition and an upper / lower bound for the range of offsets.  When
an executor computes the partition, it connects to kafka and pulls only
those messages, then closes the connection.  There's no long running
receiver at all, no caching of connections (I found caching sockets didn't
matter much).

You get much better cluster utilization that way, because if a partition is
relatively small compared to the others in the RDD, the executor gets done
with it and gets scheduled another one to work one.  With long running
receivers spark acts like the receiver takes up a core even if it isn't
doing much.  Look at the CPU graph on slide 13 of the link i posted.

On Thu, May 14, 2015 at 4:21 PM, Cody Koeninger c...@koeninger.org wrote:

 If the transformation you're trying to do really is per-partition, it
 shouldn't matter whether you're using scala methods or spark methods.  The
 parallel speedup you're getting is all from doing the work on multiple
 machines, and shuffle or caching or other benefits of spark aren't a factor.

 If using scala methods bothers you, do all of your transformation using
 spark methods, collect the results back to the driver, and save them with
 the offsets there:

 stream.foreachRDD { rdd =
   val offsets = rdd.asInstanceOf[HasOffsets].offsetRanges
   val results = rdd.some.chain.of.spark.calls.collect
   save(offsets, results)
 }

 My work-in-progress slides for my talk at the upcoming spark conference
 are here

 http://koeninger.github.io/kafka-exactly-once/

 if that clarifies that point a little bit (slides 20 vs 21)

 The direct stream doesn't use long-running receivers, so the concerns that
 blog post is trying to address don't really apply.

 Under normal operation a given partition of an rdd is only going to be
 handled by a single executor at a time (as long as you don't turn on
 speculative execution... or I suppose it might be possible in some kind of
 network partition situation).  Transactionality should save you even if
 something weird happens though.

 On Thu, May 14, 2015 at 3:44 PM, will-ob will.obr...@tapjoy.com wrote:

 Hey Cody (et. al.),

 Few more questions related to this. It sounds like our missing data issues
 appear fixed with this approach. Could you shed some light on a few
 questions that came up?

 -

 Processing our data inside a single foreachPartition function appears to
 be
 very different from the pattern seen in the programming guide. Does this
 become problematic with additional, interleaved reduce/filter/map steps?

 ```
 # typical?
 rdd
   .map { ... }
   .reduce { ... }
   .filter { ... }
   .reduce { ... }
   .foreachRdd { writeToDb }

 # with foreachPartition
 rdd.foreachPartition { case (iter) =
   iter
 .map { ... }
 .reduce { ... }
 .filter { ... }
 .reduce { ... }
 }

 ```
 -

 Could the above be simplified by having

 one kafka partition per DStream, rather than
 one kafka partition per RDD partition

 ?

 That way, we wouldn't need to do our processing inside each partition as
 there would only be one set of kafka metadata to commit.

 Presumably, one could `join` DStreams when topic-level aggregates were
 needed.

 It seems this was the approach of Michael Noll in his blog post.
 (
 http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/
 )
 Although, his primary motivation appears to be maintaining
 high-throughput /
 parallelism rather than kafka metadata.

 -

 From the blog post:

 ... there is no long-running receiver task that occupies a core per
 stream
 regardless of what the message volume is.

 Is this because data is retrieved by polling rather than maintaining a
 socket? Is it still the case that there is only one receiver process per
 DStream? If so, maybe it is wise to keep DStreams and Kafka partitions 1:1
 .. else discover the machine's NIC limit?

 Can you think of a reason not to do this? Cluster utilization, or the
 like,
 perhaps?

 

 And seems a silly question, but does `foreachPartition` guarantee that a
 single worker will process the passed function? Or might two workers split
 the work?

 Eg. foreachPartition(f)

 Worker 1: f( Iterator[partition 1 records 1 - 50] )
 Worker 2: f( Iterator[partition 1 records 51 - 100] )

 It is unclear from the scaladocs
 (
 https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 ).
 But you can imagine, if it is critical that this data be committed in a
 single transaction, that two workers will have issues.



 -- Will O




 --
 View this message in context:
 

Re: s3 vfs on Mesos Slaves

2015-05-14 Thread Haoyuan Li
Another way is to configure S3 as Tachyon's under storage system, and then
run Spark on Tachyon.

More info: http://tachyon-project.org/Setup-UFS.html

Best,

Haoyuan

On Wed, May 13, 2015 at 10:52 AM, Stephen Carman scar...@coldlight.com
wrote:

 Thank you for the suggestions, the problem exists in the fact we need to
 initialize the vfs s3 driver so what you suggested Akhil wouldn’t fix the
 problem.

 Basically a job is submitted to the cluster and it tries to pull down the
 data from s3, but fails because the s3 uri hasn’t been initilized in the
 vfs and it doesn’t know how to handle
 the URI.

 What I’m asking is, how do we before the job is ran, run some
 bootstrapping or setup code that will let us do this initialization or
 configuration step for the vfs so that when it executes the job
 it has the information it needs to be able to handle the s3 URI.

 Thanks,
 Steve

 On May 13, 2015, at 12:35 PM, jay vyas jayunit100.apa...@gmail.com
 mailto:jayunit100.apa...@gmail.com wrote:


 Might I ask why vfs?  I'm new to vfs and not sure wether or not it
 predates the hadoop file system interfaces (HCFS).

 After all spark natively supports any HCFS by leveraging the hadoop
 FileSystem api and class loaders and so on.

 So simply putting those resources on your classpath should be sufficient
 to directly connect to s3. By using the sc.hadoopFile (...) commands.

 On May 13, 2015 12:16 PM, Akhil Das ak...@sigmoidanalytics.commailto:
 ak...@sigmoidanalytics.com wrote:
 Did you happened to have a look at this https://github.com/abashev/vfs-s3

 Thanks
 Best Regards

 On Tue, May 12, 2015 at 11:33 PM, Stephen Carman scar...@coldlight.com
 mailto:scar...@coldlight.com
 wrote:

  We have a small mesos cluster and these slaves need to have a vfs setup
 on
  them so that the slaves can pull down the data they need from S3 when
 spark
  runs.
 
  There doesn’t seem to be any obvious way online on how to do this or how
  easily accomplish this. Does anyone have some best practices or some
 ideas
  about how to accomplish this?
 
  An example stack trace when a job is ran on the mesos cluster…
 
  Any idea how to get this going? Like somehow bootstrapping spark on run
 or
  something?
 
  Thanks,
  Steve
 
 
  java.io.IOException: Unsupported scheme s3n for URI s3n://removed
  at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
  at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
  15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID
  1)
  java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n
  for URI s3n://removed
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307)
  at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
  at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
  Caused by: java.io.IOException: Unsupported scheme s3n for URI
  s3n://removed
  at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330)
  at
 
 com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304)
  ... 8 more
 
  This e-mail is intended solely for the above-mentioned recipient and it
  

testing HTML email

2015-05-14 Thread Reynold Xin
Testing html emails ...

Hello

*This is bold*

This is a link http://databricks.com/


Spark Summit 2015 - June 15-17 - Dev list invite

2015-05-14 Thread Scott walent
*Join the Apache Spark community at the fourth Spark Summit in San
Francisco on June 15, 2015. At Spark Summit 2015 you will hear keynotes
from NASA, the CIA, Toyota, Databricks, AWS, Intel, MapR, IBM, Cloudera,
Hortonworks, Timeful, O'Reilly, and Andreessen Horowitz. 260 talks proposal
were submitted by the community, and 55 were accepted. This year you’ll
hear about Spark in use at companies including Uber, Airbnb, Netflix,
Taobao, Red Hat, Edmunds, Oracle and more.  See the full agenda at
http://spark-summit.org/2015 http://spark-summit.org/2015.  *




*If you are new to Spark or looking to improve on your knowledge of the
technology, we have three levels of Spark Training: Intro to Spark,
Advanced DevOps with Spark, and Data Science with Spark. Space is limited
and we will sell out so register now. Use promo code DevList15 to save
15% when registering before June 1, 2015. Register at
http://spark-summit.org/2015/register
http://spark-summit.org/2015/register.I look forward to seeing you
there.Best, Scott  The Spark Summit Organizers*


RE: testing HTML email

2015-05-14 Thread Ulanov, Alexander
Testing too. Recently I got few undelivered mails to dev-list.

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, May 14, 2015 3:39 PM
To: dev@spark.apache.org
Subject: testing HTML email

Testing html emails ...

Hello

This is bold

This is a linkhttp://databricks.com/




Re: Change for submitting to yarn in 1.3.1

2015-05-14 Thread Marcelo Vanzin
Hi Chester,

Thanks for the feedback. A few of those are great candidates for
improvements to the launcher library.

On Wed, May 13, 2015 at 5:44 AM, Chester At Work ches...@alpinenow.com
wrote:

  1) client should not be private ( unless alternative is provided) so
 we can call it directly.


Patrick already touched on this subject, but I believe Client should be
kept private. If we want to expose functionality for code launching Spark
apps, Spark should provide an interface for that so that other cluster
managers can benefit. It also keeps the API more consistent (everybody uses
the same API regardless of what's the underlying cluster manager).


  2) we need a way to stop the running yarn app programmatically ( the
 PR is already submitted)


My first reaction to this was with the app id, you can talk to YARN
directly and do that. But given what I wrote above, I guess it would make
sense for something like this to be exposed through the library too.


  3) before we start the spark job, we should have a call back to the
 application, which will provide the yarn container capacity (number of
 cores and max memory ), so spark program will not set values beyond max
 values (PR submitted)


I'm not sure exactly what you mean here, but it feels like we're starting
to get into wrapping the YARN API territory. Someone who really cares
about that information can easily fetch it from YARN, the same way Spark
would.


  4) call back could be in form of yarn app listeners, which call back
 based on yarn status changes ( start, in progress, failure, complete etc),
 application can react based on these events in PR)


Exposing some sort of status for the running application does sound useful.


  5) yarn client passing arguments to spark program in the form of main
 program, we had experience problems when we pass a very large argument due
 the length limit. For example, we use json to serialize the argument and
 encoded, then parse them as argument. For wide columns datasets, we will
 run into limit. Therefore, an alternative way of passing additional larger
 argument is needed. We are experimenting with passing the args via a
 established akka messaging channel.


I believe you're talking about command line arguments to your application
here? I'm not sure what Spark can do to alleviate this. With YARN, if you
need to pass a lot of information to your application, I'd recommend
creating a file and adding it to --files when calling spark-submit. It's
not optimal, since the file will be distributed to all executors (not just
the AM), but it should help if you're really running into problems, and is
simpler than using akka.


 6) spark yarn client in yarn-cluster mode right now is essentially a
 batch job with no communication once it launched. Need to establish the
 communication channel so that logs, errors, status updates, progress bars,
 execution stages etc can be displayed on the application side. We added an
 akka communication channel for this (working on PR ).


This is another thing I'm a little unsure about. It seems that if you
really need to, you can implement your own SparkListener to do all this,
and then you can transfer all that data to wherever you need it to be.
Exposing something like this in Spark would just mean having to support
some new RPC mechanism as a public API, which is kinda burdensome. You
mention yourself you've done something already, which is an indication that
you can do it with existing Spark APIs, which makes it not a great
candidate for a core Spark feature. I guess you can't get logs with that
mechanism, but you can use your custom log4j configuration for that if you
really want to pipe logs to a different location.

But those are a good starting point - knowing what people need is the first
step in adding a new feature. :-)

-- 
Marcelo


Re: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Yi Zhang
If I pass the whole statement as dbtable to sqlContext.load() method as 
below:val query =
  
(select t1._salory as salory,
|t1._name as employeeName,
|(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
|from mock_employees t1
|inner join mock_locations t2
|on t1._location_id = t2._id
|where t1._salory  t2._max_price) EMP
  .stripMargin
val dataFrame = sqlContext.load(jdbc, Map(
  url - url,
  driver - com.mysql.jdbc.Driver,
  dbtable - query
)) 
It works. However, I can't invoke sql() method to solve this problem. And why?



 On Friday, May 15, 2015 11:33 AM, Yi Zhang zhangy...@yahoo.com.INVALID 
wrote:
   

 The sql statement is like this:select t1._salory as salory,
t1._name as employeeName,
(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
from mock_employees t1
inner join mock_locations t2
on t1._location_id = t2._id
where t1._salory  t2._max_price

I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in 
predicates - ASF JIRA is still in the progress. And somebody commented it that 
Spark 1.3 would support it. So I don't know current status for this feature.  
Thanks.
Regards,Yi
|   |
|   |   |   |   |   |
| [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF 
JIRAjava.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug 
TOK_INSERT TOK_DE... |
|  |
| View on issues.apache.org | Preview by Yahoo |
|  |
|   |

  

  

RE: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Cheng, Hao
You need to register the “dataFrame” as a table first and then do queries on 
it? Do you mean that also failed?

From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Friday, May 15, 2015 1:10 PM
To: Yi Zhang; Dev
Subject: Re: Does Spark SQL (JDBC) support nest select with current version

If I pass the whole statement as dbtable to sqlContext.load() method as below:

val query =
  
(select t1._salory as salory,
|t1._name as employeeName,
|(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
|from mock_employees t1
|inner join mock_locations t2
|on t1._location_id = t2._id
|where t1._salory  t2._max_price) EMP
  .stripMargin
val dataFrame = sqlContext.load(jdbc, Map(
  url - url,
  driver - com.mysql.jdbc.Driver,
  dbtable - query
))

It works. However, I can't invoke sql() method to solve this problem. And why?



On Friday, May 15, 2015 11:33 AM, Yi Zhang 
zhangy...@yahoo.com.INVALIDmailto:zhangy...@yahoo.com.INVALID wrote:

The sql statement is like this:
select t1._salory as salory,
t1._name as employeeName,
(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
from mock_employees t1
inner join mock_locations t2
on t1._location_id = t2._id
where t1._salory  t2._max_price

I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in 
predicates - ASF JIRAhttps://issues.apache.org/jira/browse/SPARK-4226 is 
still in the progress. And somebody commented it that Spark 1.3 would support 
it. So I don't know current status for this feature.  Thanks.

Regards,
Yi












[SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF 
JIRAhttps://issues.apache.org/jira/browse/SPARK-4226
java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug 
TOK_INSERT TOK_DE...


View on issues.apache.orghttps://issues.apache.org/jira/browse/SPARK-4226

Preview by Yahoo









Build change PSA: Hadoop 2.2 default; -Phadoop-x.y profile recommended for builds

2015-05-14 Thread Sean Owen
This change will be merged shortly for Spark 1.4, and has a minor
implication for those creating their own Spark builds:

https://issues.apache.org/jira/browse/SPARK-7249
https://github.com/apache/spark/pull/5786

The default Hadoop dependency has actually been Hadoop 2.2 for some
time, but the defaults weren't fully consistent as a Hadoop 2.2 build.
That is what this resolves. The discussion highlights that it's
actually not great to rely on the Hadoop defaults, if you care at all
about the Hadoop binding, and that it's good practice to set some
-Phadoop-x.y profile in any build.


The net changes are:

If you don't care about Hadoop at all, you could ignore this. You will
get a consistent Hadoop 2.2 binding by default now. Still, you may
wish to set a Hadoop profile.

If you build for Hadoop 1, you need to set -Phadoop-1 now.

If you build for Hadoop 2.2, you should still set -Phadoop-2.2 even
though this is the default and is a no-op profile now.

You can continue to set other Hadoop profiles and override
hadoop.version; these are unaffected.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Simple SQL queries producing invalid results [SPARK-6743]

2015-05-14 Thread Santiago Mola
Hi,

Could someone give a look to this issue?

[SPARK-6743] Join with empty projection on one side produces invalid results
https://issues.apache.org/jira/browse/SPARK-6743

Thank you,
-- 

Santiago M. Mola


http://www.stratio.com/
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
https://twitter.com/StratioBD*


Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Yi Zhang
The sql statement is like this:select t1._salory as salory,
t1._name as employeeName,
(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
from mock_employees t1
inner join mock_locations t2
on t1._location_id = t2._id
where t1._salory  t2._max_price

I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in 
predicates - ASF JIRA is still in the progress. And somebody commented it that 
Spark 1.3 would support it. So I don't know current status for this feature.  
Thanks.
Regards,Yi
|   |
|   |   |   |   |   |
| [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF 
JIRAjava.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3))TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug 
TOK_INSERT TOK_DE... |
|  |
| View on issues.apache.org | Preview by Yahoo |
|  |
|   |