Need advice on hooking into Sql query plan

2015-11-05 Thread Yana Kadiyska
Hi folks, not sure if this belongs to dev or user list..sending to dev as
it seems a bit convoluted.

I have a UI in which we allow users to write ad-hoc queries against a (very
large, partitioned) table. I would like to analyze the queries prior to
execution for two purposes:

1. Reject under-constrained queries (i.e. there is a field predicate that I
want to make sure is always present)
2. Augment the query with additional predicates (e.g if the user asks for a
student_id I also want to push a constraint on another field)

I could parse the sql string before passing to spark but obviously spark
already does this anyway. Can someone give me general direction on how to
do this (if possible).

Something like

myDF = sql("user_sql_query")
myDF.queryExecution.logical  //here examine the filters provided by user,
reject if underconstrained, push new filters as needed (via
withNewChildren?)

at this point with some luck I'd have a new LogicalPlan -- what is the
proper way to create an execution plan on top of this new Plan? Im looking
at this
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
but this method is restricted to the package. I'd really prefer to hook
into this as early as possible and still let spark run the plan
optimizations as usual.

Any guidance or pointers much appreciated.


Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Yana Kadiyska
I don't think a view would help -- in the case of under-constraining, I
want to make sure that the user is constraining a column (e.g. I want to
restrict them to querying a single partition at a time but I don't care
which one)...a view per partition value is not practical due to the fairly
high cardinality...

In the case of predicate augmentation, the additional predicate depends on
the value the user is providing e.g. my data is partitioned under
teacherName but the end users don't have this information...So if they ask
for student_id="1234" I'd like to add "teacherName='Smith'" based on a
mapping that is not surfaced to the user (sorry for the contrived
example)...But I don't think I can do this with a view. A join will produce
the right answer but is counter-productive as my goal is to minimize the
partitions being processed.

I can parse the query myself -- I was not fond of this solution as I'd go
sql string to parse tree back to augmented sql string only to have spark
repeat the first part of the exercisebut will do if need be. And yes,
I'd have to be able to process sub-queries too...

On Thu, Nov 5, 2015 at 5:50 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Would it be possible to use views to address some of your requirements?
>
> Alternatively it might be better to parse it yourself. There are open
> source libraries for it, if you need really a complete sql parser. Do you
> want to do it on sub queries?
>
> On 05 Nov 2015, at 23:34, Yana Kadiyska <yana.kadiy...@gmail.com> wrote:
>
> Hi folks, not sure if this belongs to dev or user list..sending to dev as
> it seems a bit convoluted.
>
> I have a UI in which we allow users to write ad-hoc queries against a
> (very large, partitioned) table. I would like to analyze the queries prior
> to execution for two purposes:
>
> 1. Reject under-constrained queries (i.e. there is a field predicate that
> I want to make sure is always present)
> 2. Augment the query with additional predicates (e.g if the user asks for
> a student_id I also want to push a constraint on another field)
>
> I could parse the sql string before passing to spark but obviously spark
> already does this anyway. Can someone give me general direction on how to
> do this (if possible).
>
> Something like
>
> myDF = sql("user_sql_query")
> myDF.queryExecution.logical  //here examine the filters provided by user,
> reject if underconstrained, push new filters as needed (via
> withNewChildren?)
>
> at this point with some luck I'd have a new LogicalPlan -- what is the
> proper way to create an execution plan on top of this new Plan? Im looking
> at this
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
> but this method is restricted to the package. I'd really prefer to hook
> into this as early as possible and still let spark run the plan
> optimizations as usual.
>
> Any guidance or pointers much appreciated.
>
>


Re: Apache gives exception when running groupby on df temp table

2015-07-17 Thread Yana Kadiyska
I think that might be a connector issue. You say you are using Spark 1.4,
are you also using 1.4 version of the Spark-cassandra-connector? The do
have some bugs around this, e.g.
https://datastax-oss.atlassian.net/browse/SPARKC-195. Also, I see that you
import org.apache.spark.sql.cassandra.CassandraSQLContext and I've seen
some odd things using that class. Things work out a lot better for me if I
create a dataframe like this:


val cassDF = 
sqlContext.read.format(org.apache.spark.sql.cassandra).options(Map(
table - some_table, keyspace - myks)).load

​

On Fri, Jul 17, 2015 at 10:52 AM, nipun ibnipu...@gmail.com wrote:

 spark version 1.4

 import com.datastax.spark.connector._
 import  org.apache.spark._
 import org.apache.spark.sql.cassandra.CassandraSQLContext
 import org.apache.spark.SparkConf
 //import com.microsoft.sqlserver.jdbc.SQLServerDriver
 import java.sql.Connection
 import java.sql.DriverManager
 import java.io.IOException
 import org.apache.spark.sql.DataFrame

  def populateEvents() : Unit = {

 var query = SELECT brandname, appname, packname,
 eventname,
 client, timezone  FROM sams.events WHERE eventtime  ' + _from + ' AND
 eventtime  ' + _to + '
 // read data from cassandra table
 val rdd = runCassandraQuery(query)

 rdd.registerTempTable(newdf)

 query = Select brandname, appname, packname, eventname,
 client.OSName as platform, timezone from newdf
 val dfCol = runCassandraQuery(query)

 val grprdd = dfCol.groupBy(brandname, appname,
 packname, eventname, platform, timezone).count()

 Do let me know if you need any more information



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-gives-exception-when-running-groupby-on-df-temp-table-tp13275p13285.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Problem with version compatibility

2015-06-25 Thread Yana Kadiyska
Jim, I do something similar to you. I mark all dependencies as provided and
then make sure to drop the same version of spark-assembly in my war as I
have on the executors. I don't remember if dropping in server/lib works, I
think I ran into an issue with that. Would love to know best practices
when it comes to Tomcat and Spark

On Thu, Jun 25, 2015 at 11:23 AM, Sean Owen so...@cloudera.com wrote:

 Try putting your same Mesos assembly on the classpath of your client
 then, to emulate what spark-submit does. I don't think you merely also
 want to put it on the classpath but make sure nothing else from Spark
 is coming from your app.

 In 1.4 there is the 'launcher' API which makes programmatic access a
 lot more feasible but still kinda needs you to get Spark code to your
 driver program, and if it's not the same as on your cluster you'd
 still risk some incompatibilities.

 On Thu, Jun 25, 2015 at 6:05 PM, jimfcarroll jimfcarr...@gmail.com
 wrote:
  Ah. I've avoided using spark-submit primarily because our use of Spark
 is as
  part of an analytics library that's meant to be embedded in other
  applications with their own lifecycle management.
 
  One of those application is a REST app running in tomcat which will make
 the
  use of spark-submit difficult (if not impossible).
 
  Also, we're trying to avoid sending jars over the wire per-job and so we
  install our library (minus the spark dependencies) on the mesos workers
 and
  refer to it in the spark configuration using
 spark.executor.extraClassPath
  and if I'm reading SparkSubmit.scala correctly, it looks like the user's
  assembly ends up sent to the cluster (at least in the case of yarn)
 though I
  could be wrong on this.
 
  Is there a standard way of running an app that's in control of it's own
  runtime lifecycle without spark-submit?
 
  Thanks again.
  Jim
 
 
 
 
  --
  View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Problem-with-version-compatibility-tp12861p12894.html
  Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




[SparkSQL] HiveContext multithreading bug?

2015-05-18 Thread Yana Kadiyska
Hi folks,

wanted to get a sanity check before opening a JIRA. I am trying to do the
following:

create a HiveContext, then from different threads:

1. Create a DataFrame
2. Name said df via registerTempTable
3. do a simple query via sql and dropTempTable

My understanding is that since HiveContext is thread safe this should work
fine. But I get into a state where step 3 cannot find the table:
org.apache.spark.sql.AnalysisException:
no such table

I put the full code and error observed here:
https://gist.github.com/yanakad/7faea20ca980e8085234

I don't think the parquet file itself is relevant but I can provide it if
I'm wrong. I can reproduce this in about 1 in 5 runs...

Thanks!


Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-13 Thread Yana Kadiyska
Since you mentioned this, I had a related quandry recently -- it also says
that the forum archives *u...@spark.incubator.apache.org
u...@spark.incubator.apache.org/* *d...@spark.incubator.apache.org
d...@spark.incubator.apache.org *respectively, yet the Community page
clearly says to email the @spark.apache.org list (but the nabble archive is
linked right there too). IMO even putting a clear explanation at the top

Posting here requires that you create an account via the UI. Your message
will be sent to both spark.incubator.apache.org and spark.apache.org (if
that is the case, i'm not sure which alias nabble posts get sent to) would
make things a lot more clear.

On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com wrote:

 I've noticed that several users are attempting to post messages to Spark's
 user / dev mailing lists using the Nabble web UI (
 http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there
 are many posts in Nabble that are not posted to the Apache lists and are
 flagged with This post has NOT been accepted by the mailing list yet.
 errors.

 I suspect that the issue is that users are not completing the sign-up
 confirmation process (
 http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
 which is preventing their emails from being accepted by the mailing list.

 I wanted to mention this issue to the Spark community to see whether there
 are any good solutions to address this.  I have spoken to users who think
 that our mailing list is unresponsive / inactive because their un-posted
 messages haven't received any replies.

 - Josh



Re: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe

2014-12-03 Thread Yana Kadiyska
Thanks Michael, you are correct.

I also opened https://issues.apache.org/jira/browse/SPARK-4702 -- if
someone can comment on why this might be happening that would be great.
This would be a blocker to me using 1.2 and it used to work so I'm a bit
puzzled. I was hoping that it's again a result of the default profile
switch but it didn't seem to be the case

(ps. please advise if this is more user-list appropriate. I'm posting to
dev as it's an RC)

On Tue, Dec 2, 2014 at 8:37 PM, Michael Armbrust mich...@databricks.com
wrote:

 In Hive 13 (which is the default for Spark 1.2), parquet is included and
 thus we no longer include the Hive parquet bundle. You can now use the
 included
 ParquetSerDe: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

 If you want to compile Spark 1.2 with Hive 12 instead you can pass
 -Phive-0.12.0 and  parquet.hive.serde.ParquetHiveSerDe will be included as
 before.

 Michael

 On Tue, Dec 2, 2014 at 9:31 AM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Apologies if people get this more than once -- I sent mail to dev@spark
 last night and don't see it in the archives. Trying the incubator list
 now...wanted to make sure it doesn't get lost in case it's a bug...

 -- Forwarded message --
 From: Yana Kadiyska yana.kadiy...@gmail.com
 Date: Mon, Dec 1, 2014 at 8:10 PM
 Subject: [Thrift,1.2 RC] what happened to
 parquet.hive.serde.ParquetHiveSerDe
 To: dev@spark.apache.org


 Hi all, apologies if this is not a question for the dev list -- figured
 User list might not be appropriate since I'm having trouble with the RC
 tag.

 I just tried deploying the RC and running ThriftServer. I see the
 following
 error:

 14/12/01 21:31:42 ERROR UserGroupInformation: PriviledgedActionException
 as:anonymous (auth:SIMPLE)
 cause:org.apache.hive.service.cli.HiveSQLException:
 java.lang.RuntimeException:
 MetaException(message:java.lang.ClassNotFoundException Class
 parquet.hive.serde.ParquetHiveSerDe not found)
 14/12/01 21:31:42 WARN ThriftCLIService: Error executing statement:
 org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException:
 MetaException(message:java.lang.ClassNotFoundException Class
 parquet.hive.serde.ParquetHiveSerDe not found)
 at

 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
 at

 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
 at

 org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
 at

 org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 ​


 I looked at a working installation that I have(build master a few weeks
 ago) and this class used to be included in spark-assembly:

 ls *.jar|xargs grep parquet.hive.serde.ParquetHiveSerDe
 Binary file spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar
 matches

 but with the RC build it's not there?

 I tried both the prebuilt CDH drop and later manually built the tag with
 the following command:

  ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0
 -Phive-thriftserver
 $JAVA_HOME/bin/jar -tvf spark-assembly-1.2.0-hadoop2.0.0-mr1-cdh4.2.0.jar
 |grep parquet.hive.serde.ParquetHiveSerDe

 comes back empty...





Fwd: [Thrift,1.2 RC] what happened to parquet.hive.serde.ParquetHiveSerDe

2014-12-02 Thread Yana Kadiyska
Apologies if people get this more than once -- I sent mail to dev@spark
last night and don't see it in the archives. Trying the incubator list
now...wanted to make sure it doesn't get lost in case it's a bug...

-- Forwarded message --
From: Yana Kadiyska yana.kadiy...@gmail.com
Date: Mon, Dec 1, 2014 at 8:10 PM
Subject: [Thrift,1.2 RC] what happened to
parquet.hive.serde.ParquetHiveSerDe
To: dev@spark.apache.org


Hi all, apologies if this is not a question for the dev list -- figured
User list might not be appropriate since I'm having trouble with the RC tag.

I just tried deploying the RC and running ThriftServer. I see the following
error:

14/12/01 21:31:42 ERROR UserGroupInformation: PriviledgedActionException
as:anonymous (auth:SIMPLE)
cause:org.apache.hive.service.cli.HiveSQLException:
java.lang.RuntimeException:
MetaException(message:java.lang.ClassNotFoundException Class
parquet.hive.serde.ParquetHiveSerDe not found)
14/12/01 21:31:42 WARN ThriftCLIService: Error executing statement:
org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException:
MetaException(message:java.lang.ClassNotFoundException Class
parquet.hive.serde.ParquetHiveSerDe not found)
at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:212)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
at
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
​


I looked at a working installation that I have(build master a few weeks
ago) and this class used to be included in spark-assembly:

ls *.jar|xargs grep parquet.hive.serde.ParquetHiveSerDe
Binary file spark-assembly-1.2.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.0.jar
matches

but with the RC build it's not there?

I tried both the prebuilt CDH drop and later manually built the tag with
the following command:

 ./make-distribution.sh --tgz -Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0
-Phive-thriftserver
$JAVA_HOME/bin/jar -tvf spark-assembly-1.2.0-hadoop2.0.0-mr1-cdh4.2.0.jar
|grep parquet.hive.serde.ParquetHiveSerDe

comes back empty...


Trouble running tests

2014-10-09 Thread Yana
Hi, apologies if I missed a FAQ somewhere.

I am trying to submit a bug fix for the very first time. Reading
instructions, I forked the git repo (at
c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.

I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true

and after a while get the following error: 

[info] ScalaTest
[info] Run completed in 3 minutes, 37 seconds.
[info] Total number of tests run: 224
[info] Suites: completed 19, aborted 0
[info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
[info] All tests passed.
[info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
[success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
[error] Expected ID character
[error] Not a valid command: hive-thriftserver
[error] Expected project ID
[error] Expected configuration
[error] Expected ':' (if selecting a configuration)
[error] Expected key
[error] Not a valid key: hive-thriftserver
[error] hive-thriftserver/test
[error]  ^


(I am running this without my changes)

I have 2 questions:
1. How to fix this
2. Is there a best practice on what to fork so you start off with a good
state? I'm wondering if I should sync the latest changes or go back to a
label?

thanks in advance




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org