Multiple Thrift servers on one Spark cluster
Hi, Is there a way to instantiate multiple Thrift servers on one Spark Cluster? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Thrift-servers-on-one-Spark-cluster-tp24148.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Add row IDs column to data frame
Hi, I just checked and i can see that there is method called withColumn: def withColumn(colName: String, col: Column http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html ): DataFrame http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html Returns a new DataFrame http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html by adding a column. I can't test it now... But i think it should work. As i see it whole idea for data frames is to make them like data frames in R. And in R you can do that easily. It was late last night and i was tired but my idea was that you can iterate over first set add some index to every log using acumulators and then iterate over other set and add index from other acumulator then create tuple with keys from indexes and join. It is ugly and not efficient, and you should avoid it. :] Best Bojan On Thu, Apr 9, 2015 at 1:35 AM, barmaley [via Apache Spark User List] ml-node+s1001560n22430...@n3.nabble.com wrote: Hi Bojan, Could you please expand your idea on how to append to RDD? I can think of how to append a constant value to each row on RDD: //oldRDD - RDD[Array[String]] val c = const val newRDD = oldRDD.map(r=c+:r) But how to append a custom column to RDD? Something like: val colToAppend = sc.makeRDD(1 to oldRDD.count().toInt) //or sc.parallelize(1 to oldRDD.count().toInt) //or (1 to 1 to oldRDD.count().toInt).toArray -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22430.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDF8NTk3ODE0NzQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Append-column-to-Data-Frame-or-RDD-tp22385p22432.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SQL can't not create Hive database
I think it uses local dir, hdfs dir path starts with hdfs:// Check permissions on folders, and also check logs. There should be more info about exception. Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-can-t-not-create-Hive-database-tp22435p22439.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Caching and Actions
You can use toDebugString to see all the steps in job. Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-and-Actions-tp22418p22433.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Add row IDs column to data frame
You could convert DF to RDD, then in map phase or in join add new column, and then again convert to DF. I know this is not elegant solution and maybe it is not a solution at all. :) But this is the first thing that popped in my mind. I am new also to DF api. Best Bojan On Apr 9, 2015 00:37, olegshirokikh [via Apache Spark User List] ml-node+s1001560n22427...@n3.nabble.com wrote: More generic version of a question below: Is it possible to append a column to existing DataFrame at all? I understand that this is not an easy task in Spark environment, but is there any workaround? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22427.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDF8NTk3ODE0NzQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22428.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark 1.3 build with hive support fails
Try building with scala 2.10. Best Bojan On Mar 31, 2015 01:51, nightwolf [via Apache Spark User List] ml-node+s1001560n22309...@n3.nabble.com wrote: I am having the same problems. Did you find a fix? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-build-with-hive-support-fails-tp22215p22309.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDF8NTk3ODE0NzQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-build-with-hive-support-fails-tp22215p22312.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Nested Case Classes (Found and Required Same)
Did you find any other way for this issue? I just found out that i have 22 columns data set... And now i am searching for best solution. Anyone else have experienced with this problem? Best Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Case-Classes-Found-and-Required-Same-tp14096p21908.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark + Tableau
I finally solved issue with Spark Tableau connection. Thanks Denny Lee for blog post: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql Solution was to use Authentication type Username. And then use username for metastore. Best regards Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p18591.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SQL COUNT DISTINCT
Here is the link on jira: https://issues.apache.org/jira/browse/SPARK-4243 https://issues.apache.org/jira/browse/SPARK-4243 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p18166.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SQL COUNT DISTINCT
Hi Michael, Thanks for response. I did test with query that you send me. And it works really faster: Old queries stats by phases: 3.2min 17s Your query stats by phases: 0.3 s 16 s 20 s But will this improvement also affect when you want to count distinct on 2 or more fields: SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile Should i still create Jira issue/improvement for this? @Nick That also make sense. But should i just get count of my data to driver node? I just started to learn about Spark(and it is great) so sorry if i ask stupid questions or anything like that. Best regards Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p17939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SQL COUNT DISTINCT
While i testing Spark SQL i noticed that COUNT DISTINCT works really slow. Map partitions phase finished fast, but collect phase is slow. It's only runs on single executor. Should this run this way? And here is the simple code which i use for testing: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/) parquetFile.registerTempTable(parquetFile) val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile) count.map(t = t(0)).collect().foreach(println) I guess because of the distinct process must be on single node. But i wonder can i add some parallelism to the collect process. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark + Tableau
I'm testing beta driver from Databricks for Tableua. And unfortunately i encounter some issues. While beeline connection works without problems, Tableau can't connect to spark thrift server. Error from driver(Tableau): Unable to connect to the ODBC Data Source. Check that the necessary drivers are installed and that the connection properties are valid. [Simba][SparkODBC] (34) Error from Spark: ETIMEDOUT. Unable to connect to the server test.server.com. Check that the server is running and that you have access privileges to the requested database. Unable to connect to the server. Check that the server is running and that you have access privileges to the requested database. Exception on Thrift server: java.lang.RuntimeException: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182) at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253) at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216) ... 4 more Is there anyone else who's testing this driver, or did anyone saw this message? Best regards Bojan Kostić -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark + Tableau
I use beta driver SQL ODBC from Databricks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark + Tableau
I'm connecting to it remotly with tableau/beeline. On Thu Oct 30 16:51:13 2014 GMT+0100, Denny Lee [via Apache Spark User List] wrote: When you are starting the thrift server service - are you connecting to it locally or is this on a remote server when you use beeline and/or Tableau? On Thu, Oct 30, 2014 at 8:00 AM, Bojan Kostic blood9ra...@gmail.com wrote: I use beta driver SQL ODBC from Databricks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org ___ If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17734.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Spark + Tableau, visit http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=17720code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDE3NzIwfDU5NzgxNDc0Ng= -- Sent from my Jolla -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17737.html Sent from the Apache Spark User List mailing list archive at Nabble.com.