Need advice on hooking into Sql query plan
Hi folks, not sure if this belongs to dev or user list..sending to dev as it seems a bit convoluted. I have a UI in which we allow users to write ad-hoc queries against a (very large, partitioned) table. I would like to analyze the queries prior to execution for two purposes: 1. Reject under-constrained queries (i.e. there is a field predicate that I want to make sure is always present) 2. Augment the query with additional predicates (e.g if the user asks for a student_id I also want to push a constraint on another field) I could parse the sql string before passing to spark but obviously spark already does this anyway. Can someone give me general direction on how to do this (if possible). Something like myDF = sql("user_sql_query") myDF.queryExecution.logical //here examine the filters provided by user, reject if underconstrained, push new filters as needed (via withNewChildren?) at this point with some luck I'd have a new LogicalPlan -- what is the proper way to create an execution plan on top of this new Plan? Im looking at this https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329 but this method is restricted to the package. I'd really prefer to hook into this as early as possible and still let spark run the plan optimizations as usual. Any guidance or pointers much appreciated.
Re: Need advice on hooking into Sql query plan
I don't think a view would help -- in the case of under-constraining, I want to make sure that the user is constraining a column (e.g. I want to restrict them to querying a single partition at a time but I don't care which one)...a view per partition value is not practical due to the fairly high cardinality... In the case of predicate augmentation, the additional predicate depends on the value the user is providing e.g. my data is partitioned under teacherName but the end users don't have this information...So if they ask for student_id="1234" I'd like to add "teacherName='Smith'" based on a mapping that is not surfaced to the user (sorry for the contrived example)...But I don't think I can do this with a view. A join will produce the right answer but is counter-productive as my goal is to minimize the partitions being processed. I can parse the query myself -- I was not fond of this solution as I'd go sql string to parse tree back to augmented sql string only to have spark repeat the first part of the exercisebut will do if need be. And yes, I'd have to be able to process sub-queries too... On Thu, Nov 5, 2015 at 5:50 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Would it be possible to use views to address some of your requirements? > > Alternatively it might be better to parse it yourself. There are open > source libraries for it, if you need really a complete sql parser. Do you > want to do it on sub queries? > > On 05 Nov 2015, at 23:34, Yana Kadiyska <yana.kadiy...@gmail.com> wrote: > > Hi folks, not sure if this belongs to dev or user list..sending to dev as > it seems a bit convoluted. > > I have a UI in which we allow users to write ad-hoc queries against a > (very large, partitioned) table. I would like to analyze the queries prior > to execution for two purposes: > > 1. Reject under-constrained queries (i.e. there is a field predicate that > I want to make sure is always present) > 2. Augment the query with additional predicates (e.g if the user asks for > a student_id I also want to push a constraint on another field) > > I could parse the sql string before passing to spark but obviously spark > already does this anyway. Can someone give me general direction on how to > do this (if possible). > > Something like > > myDF = sql("user_sql_query") > myDF.queryExecution.logical //here examine the filters provided by user, > reject if underconstrained, push new filters as needed (via > withNewChildren?) > > at this point with some luck I'd have a new LogicalPlan -- what is the > proper way to create an execution plan on top of this new Plan? Im looking > at this > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329 > but this method is restricted to the package. I'd really prefer to hook > into this as early as possible and still let spark run the plan > optimizations as usual. > > Any guidance or pointers much appreciated. > >
Re: Apache gives exception when running groupby on df temp table
I think that might be a connector issue. You say you are using Spark 1.4, are you also using 1.4 version of the Spark-cassandra-connector? The do have some bugs around this, e.g. https://datastax-oss.atlassian.net/browse/SPARKC-195. Also, I see that you import org.apache.spark.sql.cassandra.CassandraSQLContext and I've seen some odd things using that class. Things work out a lot better for me if I create a dataframe like this: val cassDF = sqlContext.read.format(org.apache.spark.sql.cassandra).options(Map( table - some_table, keyspace - myks)).load On Fri, Jul 17, 2015 at 10:52 AM, nipun ibnipu...@gmail.com wrote: spark version 1.4 import com.datastax.spark.connector._ import org.apache.spark._ import org.apache.spark.sql.cassandra.CassandraSQLContext import org.apache.spark.SparkConf //import com.microsoft.sqlserver.jdbc.SQLServerDriver import java.sql.Connection import java.sql.DriverManager import java.io.IOException import org.apache.spark.sql.DataFrame def populateEvents() : Unit = { var query = SELECT brandname, appname, packname, eventname, client, timezone FROM sams.events WHERE eventtime ' + _from + ' AND eventtime ' + _to + ' // read data from cassandra table val rdd = runCassandraQuery(query) rdd.registerTempTable(newdf) query = Select brandname, appname, packname, eventname, client.OSName as platform, timezone from newdf val dfCol = runCassandraQuery(query) val grprdd = dfCol.groupBy(brandname, appname, packname, eventname, platform, timezone).count() Do let me know if you need any more information -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-gives-exception-when-running-groupby-on-df-temp-table-tp13275p13285.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Problem with version compatibility
Jim, I do something similar to you. I mark all dependencies as provided and then make sure to drop the same version of spark-assembly in my war as I have on the executors. I don't remember if dropping in server/lib works, I think I ran into an issue with that. Would love to know best practices when it comes to Tomcat and Spark On Thu, Jun 25, 2015 at 11:23 AM, Sean Owen so...@cloudera.com wrote: Try putting your same Mesos assembly on the classpath of your client then, to emulate what spark-submit does. I don't think you merely also want to put it on the classpath but make sure nothing else from Spark is coming from your app. In 1.4 there is the 'launcher' API which makes programmatic access a lot more feasible but still kinda needs you to get Spark code to your driver program, and if it's not the same as on your cluster you'd still risk some incompatibilities. On Thu, Jun 25, 2015 at 6:05 PM, jimfcarroll jimfcarr...@gmail.com wrote: Ah. I've avoided using spark-submit primarily because our use of Spark is as part of an analytics library that's meant to be embedded in other applications with their own lifecycle management. One of those application is a REST app running in tomcat which will make the use of spark-submit difficult (if not impossible). Also, we're trying to avoid sending jars over the wire per-job and so we install our library (minus the spark dependencies) on the mesos workers and refer to it in the spark configuration using spark.executor.extraClassPath and if I'm reading SparkSubmit.scala correctly, it looks like the user's assembly ends up sent to the cluster (at least in the case of yarn) though I could be wrong on this. Is there a standard way of running an app that's in control of it's own runtime lifecycle without spark-submit? Thanks again. Jim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Problem-with-version-compatibility-tp12861p12894.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[SparkSQL] HiveContext multithreading bug?
Hi folks, wanted to get a sanity check before opening a JIRA. I am trying to do the following: create a HiveContext, then from different threads: 1. Create a DataFrame 2. Name said df via registerTempTable 3. do a simple query via sql and dropTempTable My understanding is that since HiveContext is thread safe this should work fine. But I get into a state where step 3 cannot find the table: org.apache.spark.sql.AnalysisException: no such table I put the full code and error observed here: https://gist.github.com/yanakad/7faea20ca980e8085234 I don't think the parquet file itself is relevant but I can provide it if I'm wrong. I can reproduce this in about 1 in 5 runs... Thanks!
Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet
Since you mentioned this, I had a related quandry recently -- it also says that the forum archives *u...@spark.incubator.apache.org u...@spark.incubator.apache.org/* *d...@spark.incubator.apache.org d...@spark.incubator.apache.org *respectively, yet the Community page clearly says to email the @spark.apache.org list (but the nabble archive is linked right there too). IMO even putting a clear explanation at the top Posting here requires that you create an account via the UI. Your message will be sent to both spark.incubator.apache.org and spark.apache.org (if that is the case, i'm not sure which alias nabble posts get sent to) would make things a lot more clear. On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com wrote: I've noticed that several users are attempting to post messages to Spark's user / dev mailing lists using the Nabble web UI ( http://apache-spark-user-list.1001560.n3.nabble.com/). However, there are many posts in Nabble that are not posted to the Apache lists and are flagged with This post has NOT been accepted by the mailing list yet. errors. I suspect that the issue is that users are not completing the sign-up confirmation process ( http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1), which is preventing their emails from being accepted by the mailing list. I wanted to mention this issue to the Spark community to see whether there are any good solutions to address this. I have spoken to users who think that our mailing list is unresponsive / inactive because their un-posted messages haven't received any replies. - Josh
Re: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe
Thanks Michael, you are correct. I also opened https://issues.apache.org/jira/browse/SPARK-4702 -- if someone can comment on why this might be happening that would be great. This would be a blocker to me using 1.2 and it used to work so I'm a bit puzzled. I was hoping that it's again a result of the default profile switch but it didn't seem to be the case (ps. please advise if this is more user-list appropriate. I'm posting to dev as it's an RC) On Tue, Dec 2, 2014 at 8:37 PM, Michael Armbrust mich...@databricks.com wrote: In Hive 13 (which is the default for Spark 1.2), parquet is included and thus we no longer include the Hive parquet bundle. You can now use the included ParquetSerDe: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe If you want to compile Spark 1.2 with Hive 12 instead you can pass -Phive-0.12.0 and parquet.hive.serde.ParquetHiveSerDe will be included as before. Michael On Tue, Dec 2, 2014 at 9:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Apologies if people get this more than once -- I sent mail to dev@spark last night and don't see it in the archives. Trying the incubator list now...wanted to make sure it doesn't get lost in case it's a bug... -- Forwarded message -- From: Yana Kadiyska yana.kadiy...@gmail.com Date: Mon, Dec 1, 2014 at 8:10 PM Subject: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe To: dev@spark.apache.org Hi all, apologies if this is not a question for the dev list -- figured User list might not be appropriate since I'm having trouble with the RC tag. I just tried deploying the RC and running ThriftServer. I see the following error: 14/12/01 21:31:42 ERROR UserGroupInformation: PriviledgedActionException as:anonymous (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found) 14/12/01 21:31:42 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) I looked at a working installation that I have(build master a few weeks ago) and this class used to be included in spark-assembly: ls *.jar|xargs grep parquet.hive.serde.ParquetHiveSerDe Binary file spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar matches but with the RC build it's not there? I tried both the prebuilt CDH drop and later manually built the tag with the following command: ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver $JAVA_HOME/bin/jar -tvf spark-assembly-1.2.0-hadoop2.0.0-mr1-cdh4.2.0.jar |grep parquet.hive.serde.ParquetHiveSerDe comes back empty...
Fwd: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe
Apologies if people get this more than once -- I sent mail to dev@spark last night and don't see it in the archives. Trying the incubator list now...wanted to make sure it doesn't get lost in case it's a bug... -- Forwarded message -- From: Yana Kadiyska yana.kadiy...@gmail.com Date: Mon, Dec 1, 2014 at 8:10 PM Subject: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe To: dev@spark.apache.org Hi all, apologies if this is not a question for the dev list -- figured User list might not be appropriate since I'm having trouble with the RC tag. I just tried deploying the RC and running ThriftServer. I see the following error: 14/12/01 21:31:42 ERROR UserGroupInformation: PriviledgedActionException as:anonymous (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found) 14/12/01 21:31:42 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class parquet.hive.serde.ParquetHiveSerDe not found) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) I looked at a working installation that I have(build master a few weeks ago) and this class used to be included in spark-assembly: ls *.jar|xargs grep parquet.hive.serde.ParquetHiveSerDe Binary file spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar matches but with the RC build it's not there? I tried both the prebuilt CDH drop and later manually built the tag with the following command: ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver $JAVA_HOME/bin/jar -tvf spark-assembly-1.2.0-hadoop2.0.0-mr1-cdh4.2.0.jar |grep parquet.hive.serde.ParquetHiveSerDe comes back empty...
Trouble running tests
Hi, apologies if I missed a FAQ somewhere. I am trying to submit a bug fix for the very first time. Reading instructions, I forked the git repo (at c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests. I run this: ./dev/run-tests _SQL_TESTS_ONLY=true and after a while get the following error: [info] ScalaTest [info] Run completed in 3 minutes, 37 seconds. [info] Total number of tests run: 224 [info] Suites: completed 19, aborted 0 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0 [info] All tests passed. [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM [error] Expected ID character [error] Not a valid command: hive-thriftserver [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: hive-thriftserver [error] hive-thriftserver/test [error] ^ (I am running this without my changes) I have 2 questions: 1. How to fix this 2. Is there a best practice on what to fork so you start off with a good state? I'm wondering if I should sync the latest changes or go back to a label? thanks in advance -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org