RE: Equivalent of Redshift ListAgg function in Spark (Pyspak)

2017-10-09 Thread Mahesh Sawaiker
After doing group, you can use mkstring on the data frame. Following is an example where are columns are concatenated with space as a separator. scala> call_cdf.map(row => row.mkString(" ")).show(false)

RE: SPARK Issue in Standalone cluster

2017-07-31 Thread Mahesh Sawaiker
Gourav, Riccardo’s answer is spot on. What is happening is one node of spark is writing to its own directory and telling a slave to read the data from there, when the slave goes to read it, the part is not found. Check the folder

UI for spark machine learning.

2017-07-09 Thread Mahesh Sawaiker
Hi, 1) Is anyone aware of any workbench kind of tool to run ML jobs in spark. Specifically is the tool could be something like a Web application that is configured to connect to a spark cluster. User is able to select input training sets probably from hdfs , train and then run predictions,

RE: PySpark working with Generators

2017-06-29 Thread Mahesh Sawaiker
Wouldn’t this work if you load the files in hdfs and let the partitions be equal to the amount of parallelism you want? From: Saatvik Shah [mailto:saatvikshah1...@gmail.com] Sent: Friday, June 30, 2017 8:55 AM To: ayan guha Cc: user Subject: Re: PySpark working with Generators Hey Ayan, This

RE: using Apache Spark standalone on a server for a class/multiple users, db.lck does not get removed

2017-06-29 Thread Mahesh Sawaiker
You could copy the spark folder to home directory of each user and set a different Spark home for each one..not sure what derby is used for, but you could try using mysql instead(if its for the hive metastore) From: Robert Kudyba [mailto:rkud...@fordham.edu] Sent: Wednesday, June 28, 2017 8:25

RE: HDP 2.5 - Python - Spark-On-Hbase

2017-06-25 Thread Mahesh Sawaiker
Ayan, The location of the logging class was moved from Spark 1.6 to Spark 2.0. Looks like you are trying to run 1.6 code on 2.0, I have ported some code like this before and if you have access to the code you can recompile it by changing reference to Logging class and directly use the slf4

RE: JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Mahesh Sawaiker
This has to do with how you are creating the timestamp object from the resultset ( I guess). If you can provide more code it will help, but you could surround the parsing code with a try catch and then just ignore the exception. From: Aviral Agarwal [mailto:aviral12...@gmail.com] Sent:

RE: Using Spark as a simulator

2017-06-21 Thread Mahesh Sawaiker
Spark can help you to create one large file if needed, but hdfs itself will provide abstraction over such things, so it's a trivial problem if anything. If you have spark installed, then you can use spark-shell to try a few samples, and build from there.If you can collect all the files in a

RE: Using Spark as a simulator

2017-06-20 Thread Mahesh Sawaiker
I have already seen on example where data is generated using spark, no reason to think it's a bad idea as far as I know. You can check the code here, I m not very sure but I think there is something there which generates data for the TPCDS benchmark and you can provide how much data you want in

RE: What is the charting library used by Databricks UI?

2017-06-16 Thread Mahesh Sawaiker
Is there a live url on internet, where I can see the UI? I could help by checking the js code in firebug. From: kant kodali [mailto:kanth...@gmail.com] Sent: Friday, June 16, 2017 1:26 PM To: user @spark Subject: What is the charting library used by Databricks UI? Hi All, I am wondering what

RE: The following Error seems to happen once in every ten minutes (Spark Structured Streaming)?

2017-05-31 Thread Mahesh Sawaiker
Your data node(s) is/are going down for some reason, check the logs of the datanode and fix the underlying issue why datanode is going down. There should be no need to delete any data, just starting the data nodes should do the trick for you. From: kant kodali [mailto:kanth...@gmail.com] Sent:

RE: Spark sql with Zeppelin, Task not serializable error when I try to cache the spark sql table

2017-05-31 Thread Mahesh Sawaiker
It’s because the class in which you have defined the udf is not serializable. Declare the udf in a class and make the class seriablizable. From: shyla deshpande [mailto:deshpandesh...@gmail.com] Sent: Thursday, June 01, 2017 10:08 AM To: user Subject: Spark sql with Zeppelin, Task not