Re: Does Spark SQL (JDBC) support nest select with current version
@Hao,Because the querying joined more than one table, if I register data frame as temp table, Spark can't disguise which table is correct. I don't how to set dbtable and register temp table. Any suggestion? On Friday, May 15, 2015 1:38 PM, Cheng, Hao hao.ch...@intel.com wrote: #yiv0864379581 #yiv0864379581 -- _filtered #yiv0864379581 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv0864379581 {font-family:宋体;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv0864379581 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv0864379581 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv0864379581 {panose-1:2 11 5 3 2 2 4 2 2 4;} _filtered #yiv0864379581 {panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv0864379581 {font-family:Consolas;panose-1:2 11 6 9 2 2 4 3 2 4;} _filtered #yiv0864379581 {panose-1:2 11 5 3 2 2 4 2 2 4;} _filtered #yiv0864379581 {font-family:Georgia;panose-1:2 4 5 2 5 4 5 2 3 3;} _filtered #yiv0864379581 {font-family:Monaco;panose-1:0 0 0 0 0 0 0 0 0 0;}#yiv0864379581 #yiv0864379581 p.yiv0864379581MsoNormal, #yiv0864379581 li.yiv0864379581MsoNormal, #yiv0864379581 div.yiv0864379581MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0864379581 a:link, #yiv0864379581 span.yiv0864379581MsoHyperlink {color:blue;text-decoration:underline;}#yiv0864379581 a:visited, #yiv0864379581 span.yiv0864379581MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv0864379581 code {}#yiv0864379581 pre {margin:0cm;margin-bottom:.0001pt;font-size:10.0pt;}#yiv0864379581 span.yiv0864379581HTMLPreformattedChar {font-family:Consolas;}#yiv0864379581 span {}#yiv0864379581 span.yiv0864379581link-enhancr-element {}#yiv0864379581 span.yiv0864379581link-enhancr-view-on-domain {}#yiv0864379581 span.yiv0864379581EmailStyle23 {color:#1F497D;}#yiv0864379581 .yiv0864379581MsoChpDefault {font-size:10.0pt;} _filtered #yiv0864379581 {margin:72.0pt 90.0pt 72.0pt 90.0pt;}#yiv0864379581 div.yiv0864379581WordSection1 {}#yiv0864379581 You need to register the “dataFrame” as a table first and then do queries on it? Do you mean that also failed? From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 1:10 PM To: Yi Zhang; Dev Subject: Re: Does Spark SQL (JDBC) support nest select with current version If I pass the whole statement as dbtable to sqlContext.load() method as below: val query = (select t1._salory as salory, |t1._name as employeeName, |(select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName |from mock_employees t1 |inner join mock_locations t2 |on t1._location_id = t2._id |where t1._salory t2._max_price) EMP .stripMargin val dataFrame = sqlContext.load(jdbc, Map( url - url, driver - com.mysql.jdbc.Driver, dbtable - query )) It works. However, I can't invoke sql() method to solve this problem. And why? On Friday, May 15, 2015 11:33 AM, Yi Zhang zhangy...@yahoo.com.INVALID wrote: The sql statement is like this: select t1._salory as salory, t1._name as employeeName, (select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName from mock_employees t1 inner join mock_locations t2 on t1._location_id = t2._id where t1._salory t2._max_price I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRA is still in the progress. And somebody commented it that Spark 1.3 would support it. So I don't know current status for this feature. Thanks. Regards, Yi | | | | | | | | | | [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRA java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DE... | | | | | | View on issues.apache.org | Preview by Yahoo | | | | | | | | | | | | | | |
Re: How to link code pull request with JIRA ID?
Yeah I wrote the original script and I intentionally made it easy for other projects to use (you'll just need to tweak some variables at the top). You just need somewhere to run it... we were using a jenkins cluster to run it every 5 minutes. BTW - I looked and there is one instance where it hard cores the string SPARK-, but that should be easy to change. I'm happy to review a patch that makes that prefix a variable. https://github.com/apache/spark/blob/master/dev/github_jira_sync.py#L71 - Patrick On Thu, May 14, 2015 at 8:45 AM, Josh Rosen rosenvi...@gmail.com wrote: Spark PRs didn't always used to handle the JIRA linking. We used to rely on a Jenkins job that ran https://github.com/apache/spark/blob/master/dev/github_jira_sync.py. We switched this over to Spark PRs at a time when the Jenkins GitHub Pull Request Builder plugin was having flakiness issues, but as far as I know that old script should still work. On Wed, May 13, 2015 at 9:40 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: There's no magic to it. We're doing the same, except Josh automated it in the PR dashboard he created. https://spark-prs.appspot.com/ Nick On Wed, May 13, 2015 at 6:20 PM Markus Weimer mar...@weimo.de wrote: Hi, how did you set this up? Over in the REEF incubation project, we painstakingly create the forwards- and backwards links despite having the IDs in the PR descriptions... Thanks! Markus On 2015-05-13 11:56, Ted Yu wrote: Subproject tag should follow SPARK JIRA number. e.g. [SPARK-5277][SQL] ... Cheers On Wed, May 13, 2015 at 11:50 AM, Stephen Boesch java...@gmail.com wrote: following up from Nicholas, it is [SPARK-12345] Your PR description where 12345 is the jira number. One thing I tend to forget is when/where to include the subproject tag e.g. [MLLIB] 2015-05-13 11:11 GMT-07:00 Nicholas Chammas nicholas.cham...@gmail.com : That happens automatically when you open a PR with the JIRA key in the PR title. On Wed, May 13, 2015 at 2:10 PM Chandrashekhar Kotekar shekhar.kote...@gmail.com wrote: Hi, I am new to open source contribution and trying to understand the process starting from pulling code to uploading patch. I have managed to pull code from GitHub. In JIRA I saw that each JIRA issue is connected with pull request. I would like to know how do people attach pull request details to JIRA issue? Thanks, Chandrash3khar Kotekar Mobile - +91 8600011455 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: How to link code pull request with JIRA ID?
Spark PRs didn't always used to handle the JIRA linking. We used to rely on a Jenkins job that ran https://github.com/apache/spark/blob/master/dev/github_jira_sync.py. We switched this over to Spark PRs at a time when the Jenkins GitHub Pull Request Builder plugin was having flakiness issues, but as far as I know that old script should still work. On Wed, May 13, 2015 at 9:40 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: There's no magic to it. We're doing the same, except Josh automated it in the PR dashboard he created. https://spark-prs.appspot.com/ Nick On Wed, May 13, 2015 at 6:20 PM Markus Weimer mar...@weimo.de wrote: Hi, how did you set this up? Over in the REEF incubation project, we painstakingly create the forwards- and backwards links despite having the IDs in the PR descriptions... Thanks! Markus On 2015-05-13 11:56, Ted Yu wrote: Subproject tag should follow SPARK JIRA number. e.g. [SPARK-5277][SQL] ... Cheers On Wed, May 13, 2015 at 11:50 AM, Stephen Boesch java...@gmail.com wrote: following up from Nicholas, it is [SPARK-12345] Your PR description where 12345 is the jira number. One thing I tend to forget is when/where to include the subproject tag e.g. [MLLIB] 2015-05-13 11:11 GMT-07:00 Nicholas Chammas nicholas.cham...@gmail.com : That happens automatically when you open a PR with the JIRA key in the PR title. On Wed, May 13, 2015 at 2:10 PM Chandrashekhar Kotekar shekhar.kote...@gmail.com wrote: Hi, I am new to open source contribution and trying to understand the process starting from pulling code to uploading patch. I have managed to pull code from GitHub. In JIRA I saw that each JIRA issue is connected with pull request. I would like to know how do people attach pull request details to JIRA issue? Thanks, Chandrash3khar Kotekar Mobile - +91 8600011455 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Error recovery strategies using the DirectKafkaInputDStream
We've been using the new DirectKafkaInputDStream to implement an exactly once processing solution that tracks the provided offset ranges within the same transaction that persists our data results. When an exception is thrown within the processing loop and the configured number of retries are exhausted the stream will skip to the end of the failed range of offsets and continue on with the next RDD. Makes sense but we're wondering how others would handle recovering from failures. In our case the cause of the exception was a temporary outage of a needed service. Since the transaction rolled back at the point of failure our offset tracking table retained the correct offsets updated so we simply needed to restart the Spark process whereupon it happily picked up at the correct point and continued. Short of the restart do people have any good ideas for how we might recover? FWIW We've looked at setting spark.task.maxFailures param to a large value and looked for a property that would increase the wait between attempts. This might mitigate the issue when the availability problem is short lived but wouldn't completely eliminate the need to restart. Any thoughts, ideas welcome. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Error-recovery-strategies-using-the-DirectKafkaInputDStream-tp12258.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream
Hey Cody (et. al.), Few more questions related to this. It sounds like our missing data issues appear fixed with this approach. Could you shed some light on a few questions that came up? - Processing our data inside a single foreachPartition function appears to be very different from the pattern seen in the programming guide. Does this become problematic with additional, interleaved reduce/filter/map steps? ``` # typical? rdd .map { ... } .reduce { ... } .filter { ... } .reduce { ... } .foreachRdd { writeToDb } # with foreachPartition rdd.foreachPartition { case (iter) = iter .map { ... } .reduce { ... } .filter { ... } .reduce { ... } } ``` - Could the above be simplified by having one kafka partition per DStream, rather than one kafka partition per RDD partition ? That way, we wouldn't need to do our processing inside each partition as there would only be one set of kafka metadata to commit. Presumably, one could `join` DStreams when topic-level aggregates were needed. It seems this was the approach of Michael Noll in his blog post. (http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/) Although, his primary motivation appears to be maintaining high-throughput / parallelism rather than kafka metadata. - From the blog post: ... there is no long-running receiver task that occupies a core per stream regardless of what the message volume is. Is this because data is retrieved by polling rather than maintaining a socket? Is it still the case that there is only one receiver process per DStream? If so, maybe it is wise to keep DStreams and Kafka partitions 1:1 .. else discover the machine's NIC limit? Can you think of a reason not to do this? Cluster utilization, or the like, perhaps? And seems a silly question, but does `foreachPartition` guarantee that a single worker will process the passed function? Or might two workers split the work? Eg. foreachPartition(f) Worker 1: f( Iterator[partition 1 records 1 - 50] ) Worker 2: f( Iterator[partition 1 records 51 - 100] ) It is unclear from the scaladocs (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD). But you can imagine, if it is critical that this data be committed in a single transaction, that two workers will have issues. -- Will O -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p12257.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream
Sorry, realized I probably didn't fully answer your question about my blog post, as opposed to Michael Nolls. The direct stream is really blunt, a given RDD partition is just a kafka topic/partition and an upper / lower bound for the range of offsets. When an executor computes the partition, it connects to kafka and pulls only those messages, then closes the connection. There's no long running receiver at all, no caching of connections (I found caching sockets didn't matter much). You get much better cluster utilization that way, because if a partition is relatively small compared to the others in the RDD, the executor gets done with it and gets scheduled another one to work one. With long running receivers spark acts like the receiver takes up a core even if it isn't doing much. Look at the CPU graph on slide 13 of the link i posted. On Thu, May 14, 2015 at 4:21 PM, Cody Koeninger c...@koeninger.org wrote: If the transformation you're trying to do really is per-partition, it shouldn't matter whether you're using scala methods or spark methods. The parallel speedup you're getting is all from doing the work on multiple machines, and shuffle or caching or other benefits of spark aren't a factor. If using scala methods bothers you, do all of your transformation using spark methods, collect the results back to the driver, and save them with the offsets there: stream.foreachRDD { rdd = val offsets = rdd.asInstanceOf[HasOffsets].offsetRanges val results = rdd.some.chain.of.spark.calls.collect save(offsets, results) } My work-in-progress slides for my talk at the upcoming spark conference are here http://koeninger.github.io/kafka-exactly-once/ if that clarifies that point a little bit (slides 20 vs 21) The direct stream doesn't use long-running receivers, so the concerns that blog post is trying to address don't really apply. Under normal operation a given partition of an rdd is only going to be handled by a single executor at a time (as long as you don't turn on speculative execution... or I suppose it might be possible in some kind of network partition situation). Transactionality should save you even if something weird happens though. On Thu, May 14, 2015 at 3:44 PM, will-ob will.obr...@tapjoy.com wrote: Hey Cody (et. al.), Few more questions related to this. It sounds like our missing data issues appear fixed with this approach. Could you shed some light on a few questions that came up? - Processing our data inside a single foreachPartition function appears to be very different from the pattern seen in the programming guide. Does this become problematic with additional, interleaved reduce/filter/map steps? ``` # typical? rdd .map { ... } .reduce { ... } .filter { ... } .reduce { ... } .foreachRdd { writeToDb } # with foreachPartition rdd.foreachPartition { case (iter) = iter .map { ... } .reduce { ... } .filter { ... } .reduce { ... } } ``` - Could the above be simplified by having one kafka partition per DStream, rather than one kafka partition per RDD partition ? That way, we wouldn't need to do our processing inside each partition as there would only be one set of kafka metadata to commit. Presumably, one could `join` DStreams when topic-level aggregates were needed. It seems this was the approach of Michael Noll in his blog post. ( http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/ ) Although, his primary motivation appears to be maintaining high-throughput / parallelism rather than kafka metadata. - From the blog post: ... there is no long-running receiver task that occupies a core per stream regardless of what the message volume is. Is this because data is retrieved by polling rather than maintaining a socket? Is it still the case that there is only one receiver process per DStream? If so, maybe it is wise to keep DStreams and Kafka partitions 1:1 .. else discover the machine's NIC limit? Can you think of a reason not to do this? Cluster utilization, or the like, perhaps? And seems a silly question, but does `foreachPartition` guarantee that a single worker will process the passed function? Or might two workers split the work? Eg. foreachPartition(f) Worker 1: f( Iterator[partition 1 records 1 - 50] ) Worker 2: f( Iterator[partition 1 records 51 - 100] ) It is unclear from the scaladocs ( https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD ). But you can imagine, if it is critical that this data be committed in a single transaction, that two workers will have issues. -- Will O -- View this message in context:
Re: s3 vfs on Mesos Slaves
Another way is to configure S3 as Tachyon's under storage system, and then run Spark on Tachyon. More info: http://tachyon-project.org/Setup-UFS.html Best, Haoyuan On Wed, May 13, 2015 at 10:52 AM, Stephen Carman scar...@coldlight.com wrote: Thank you for the suggestions, the problem exists in the fact we need to initialize the vfs s3 driver so what you suggested Akhil wouldn’t fix the problem. Basically a job is submitted to the cluster and it tries to pull down the data from s3, but fails because the s3 uri hasn’t been initilized in the vfs and it doesn’t know how to handle the URI. What I’m asking is, how do we before the job is ran, run some bootstrapping or setup code that will let us do this initialization or configuration step for the vfs so that when it executes the job it has the information it needs to be able to handle the s3 URI. Thanks, Steve On May 13, 2015, at 12:35 PM, jay vyas jayunit100.apa...@gmail.com mailto:jayunit100.apa...@gmail.com wrote: Might I ask why vfs? I'm new to vfs and not sure wether or not it predates the hadoop file system interfaces (HCFS). After all spark natively supports any HCFS by leveraging the hadoop FileSystem api and class loaders and so on. So simply putting those resources on your classpath should be sufficient to directly connect to s3. By using the sc.hadoopFile (...) commands. On May 13, 2015 12:16 PM, Akhil Das ak...@sigmoidanalytics.commailto: ak...@sigmoidanalytics.com wrote: Did you happened to have a look at this https://github.com/abashev/vfs-s3 Thanks Best Regards On Tue, May 12, 2015 at 11:33 PM, Stephen Carman scar...@coldlight.com mailto:scar...@coldlight.com wrote: We have a small mesos cluster and these slaves need to have a vfs setup on them so that the slaves can pull down the data they need from S3 when spark runs. There doesn’t seem to be any obvious way online on how to do this or how easily accomplish this. Does anyone have some best practices or some ideas about how to accomplish this? An example stack trace when a job is ran on the mesos cluster… Any idea how to get this going? Like somehow bootstrapping spark on run or something? Thanks, Steve java.io.IOException: Unsupported scheme s3n for URI s3n://removed at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43) at com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465) at com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42) at com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330) at com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/05/12 13:57:51 ERROR Executor: Exception in task 0.1 in stage 0.0 (TID 1) java.lang.RuntimeException: java.io.IOException: Unsupported scheme s3n for URI s3n://removed at com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:307) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unsupported scheme s3n for URI s3n://removed at com.coldlight.ccc.vfs.NeuronPath.toPath(NeuronPath.java:43) at com.coldlight.neuron.data.ClquetPartitionedData.makeInputStream(ClquetPartitionedData.java:465) at com.coldlight.neuron.data.ClquetPartitionedData.access$200(ClquetPartitionedData.java:42) at com.coldlight.neuron.data.ClquetPartitionedData$Iter.init(ClquetPartitionedData.java:330) at com.coldlight.neuron.data.ClquetPartitionedData.compute(ClquetPartitionedData.java:304) ... 8 more This e-mail is intended solely for the above-mentioned recipient and it
testing HTML email
Testing html emails ... Hello *This is bold* This is a link http://databricks.com/
Spark Summit 2015 - June 15-17 - Dev list invite
*Join the Apache Spark community at the fourth Spark Summit in San Francisco on June 15, 2015. At Spark Summit 2015 you will hear keynotes from NASA, the CIA, Toyota, Databricks, AWS, Intel, MapR, IBM, Cloudera, Hortonworks, Timeful, O'Reilly, and Andreessen Horowitz. 260 talks proposal were submitted by the community, and 55 were accepted. This year you’ll hear about Spark in use at companies including Uber, Airbnb, Netflix, Taobao, Red Hat, Edmunds, Oracle and more. See the full agenda at http://spark-summit.org/2015 http://spark-summit.org/2015. * *If you are new to Spark or looking to improve on your knowledge of the technology, we have three levels of Spark Training: Intro to Spark, Advanced DevOps with Spark, and Data Science with Spark. Space is limited and we will sell out so register now. Use promo code DevList15 to save 15% when registering before June 1, 2015. Register at http://spark-summit.org/2015/register http://spark-summit.org/2015/register.I look forward to seeing you there.Best, Scott The Spark Summit Organizers*
RE: testing HTML email
Testing too. Recently I got few undelivered mails to dev-list. From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, May 14, 2015 3:39 PM To: dev@spark.apache.org Subject: testing HTML email Testing html emails ... Hello This is bold This is a linkhttp://databricks.com/
Re: Change for submitting to yarn in 1.3.1
Hi Chester, Thanks for the feedback. A few of those are great candidates for improvements to the launcher library. On Wed, May 13, 2015 at 5:44 AM, Chester At Work ches...@alpinenow.com wrote: 1) client should not be private ( unless alternative is provided) so we can call it directly. Patrick already touched on this subject, but I believe Client should be kept private. If we want to expose functionality for code launching Spark apps, Spark should provide an interface for that so that other cluster managers can benefit. It also keeps the API more consistent (everybody uses the same API regardless of what's the underlying cluster manager). 2) we need a way to stop the running yarn app programmatically ( the PR is already submitted) My first reaction to this was with the app id, you can talk to YARN directly and do that. But given what I wrote above, I guess it would make sense for something like this to be exposed through the library too. 3) before we start the spark job, we should have a call back to the application, which will provide the yarn container capacity (number of cores and max memory ), so spark program will not set values beyond max values (PR submitted) I'm not sure exactly what you mean here, but it feels like we're starting to get into wrapping the YARN API territory. Someone who really cares about that information can easily fetch it from YARN, the same way Spark would. 4) call back could be in form of yarn app listeners, which call back based on yarn status changes ( start, in progress, failure, complete etc), application can react based on these events in PR) Exposing some sort of status for the running application does sound useful. 5) yarn client passing arguments to spark program in the form of main program, we had experience problems when we pass a very large argument due the length limit. For example, we use json to serialize the argument and encoded, then parse them as argument. For wide columns datasets, we will run into limit. Therefore, an alternative way of passing additional larger argument is needed. We are experimenting with passing the args via a established akka messaging channel. I believe you're talking about command line arguments to your application here? I'm not sure what Spark can do to alleviate this. With YARN, if you need to pass a lot of information to your application, I'd recommend creating a file and adding it to --files when calling spark-submit. It's not optimal, since the file will be distributed to all executors (not just the AM), but it should help if you're really running into problems, and is simpler than using akka. 6) spark yarn client in yarn-cluster mode right now is essentially a batch job with no communication once it launched. Need to establish the communication channel so that logs, errors, status updates, progress bars, execution stages etc can be displayed on the application side. We added an akka communication channel for this (working on PR ). This is another thing I'm a little unsure about. It seems that if you really need to, you can implement your own SparkListener to do all this, and then you can transfer all that data to wherever you need it to be. Exposing something like this in Spark would just mean having to support some new RPC mechanism as a public API, which is kinda burdensome. You mention yourself you've done something already, which is an indication that you can do it with existing Spark APIs, which makes it not a great candidate for a core Spark feature. I guess you can't get logs with that mechanism, but you can use your custom log4j configuration for that if you really want to pipe logs to a different location. But those are a good starting point - knowing what people need is the first step in adding a new feature. :-) -- Marcelo
Re: Does Spark SQL (JDBC) support nest select with current version
If I pass the whole statement as dbtable to sqlContext.load() method as below:val query = (select t1._salory as salory, |t1._name as employeeName, |(select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName |from mock_employees t1 |inner join mock_locations t2 |on t1._location_id = t2._id |where t1._salory t2._max_price) EMP .stripMargin val dataFrame = sqlContext.load(jdbc, Map( url - url, driver - com.mysql.jdbc.Driver, dbtable - query )) It works. However, I can't invoke sql() method to solve this problem. And why? On Friday, May 15, 2015 11:33 AM, Yi Zhang zhangy...@yahoo.com.INVALID wrote: The sql statement is like this:select t1._salory as salory, t1._name as employeeName, (select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName from mock_employees t1 inner join mock_locations t2 on t1._location_id = t2._id where t1._salory t2._max_price I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRA is still in the progress. And somebody commented it that Spark 1.3 would support it. So I don't know current status for this feature. Thanks. Regards,Yi | | | | | | | | | [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRAjava.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DE... | | | | View on issues.apache.org | Preview by Yahoo | | | | |
RE: Does Spark SQL (JDBC) support nest select with current version
You need to register the “dataFrame” as a table first and then do queries on it? Do you mean that also failed? From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 1:10 PM To: Yi Zhang; Dev Subject: Re: Does Spark SQL (JDBC) support nest select with current version If I pass the whole statement as dbtable to sqlContext.load() method as below: val query = (select t1._salory as salory, |t1._name as employeeName, |(select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName |from mock_employees t1 |inner join mock_locations t2 |on t1._location_id = t2._id |where t1._salory t2._max_price) EMP .stripMargin val dataFrame = sqlContext.load(jdbc, Map( url - url, driver - com.mysql.jdbc.Driver, dbtable - query )) It works. However, I can't invoke sql() method to solve this problem. And why? On Friday, May 15, 2015 11:33 AM, Yi Zhang zhangy...@yahoo.com.INVALIDmailto:zhangy...@yahoo.com.INVALID wrote: The sql statement is like this: select t1._salory as salory, t1._name as employeeName, (select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName from mock_employees t1 inner join mock_locations t2 on t1._location_id = t2._id where t1._salory t2._max_price I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRAhttps://issues.apache.org/jira/browse/SPARK-4226 is still in the progress. And somebody commented it that Spark 1.3 would support it. So I don't know current status for this feature. Thanks. Regards, Yi [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRAhttps://issues.apache.org/jira/browse/SPARK-4226 java.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DE... View on issues.apache.orghttps://issues.apache.org/jira/browse/SPARK-4226 Preview by Yahoo
Build change PSA: Hadoop 2.2 default; -Phadoop-x.y profile recommended for builds
This change will be merged shortly for Spark 1.4, and has a minor implication for those creating their own Spark builds: https://issues.apache.org/jira/browse/SPARK-7249 https://github.com/apache/spark/pull/5786 The default Hadoop dependency has actually been Hadoop 2.2 for some time, but the defaults weren't fully consistent as a Hadoop 2.2 build. That is what this resolves. The discussion highlights that it's actually not great to rely on the Hadoop defaults, if you care at all about the Hadoop binding, and that it's good practice to set some -Phadoop-x.y profile in any build. The net changes are: If you don't care about Hadoop at all, you could ignore this. You will get a consistent Hadoop 2.2 binding by default now. Still, you may wish to set a Hadoop profile. If you build for Hadoop 1, you need to set -Phadoop-1 now. If you build for Hadoop 2.2, you should still set -Phadoop-2.2 even though this is the default and is a no-op profile now. You can continue to set other Hadoop profiles and override hadoop.version; these are unaffected. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Simple SQL queries producing invalid results [SPARK-6743]
Hi, Could someone give a look to this issue? [SPARK-6743] Join with empty projection on one side produces invalid results https://issues.apache.org/jira/browse/SPARK-6743 Thank you, -- Santiago M. Mola http://www.stratio.com/ Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd https://twitter.com/StratioBD*
Does Spark SQL (JDBC) support nest select with current version
The sql statement is like this:select t1._salory as salory, t1._name as employeeName, (select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName from mock_employees t1 inner join mock_locations t2 on t1._location_id = t2._id where t1._salory t2._max_price I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRA is still in the progress. And somebody commented it that Spark 1.3 would support it. So I don't know current status for this feature. Thanks. Regards,Yi | | | | | | | | | [SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF JIRAjava.lang.RuntimeException: Unsupported language features in query: select customerid from sparkbug where customerid in (select customerid from sparkbug where customerid in (2,3))TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug TOK_INSERT TOK_DE... | | | | View on issues.apache.org | Preview by Yahoo | | | | |