Re: Spark SQL doesn't support column names that contain '-','$'...

2015-12-05 Thread manasdebashiskar
Try `X-P-S-T` (back tick) ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-doesn-t-support-column-names-that-contain-tp25529p25588.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to spark streaming application start working on next batch before completing on previous batch .

2015-12-05 Thread manasdebashiskar
I don't think it is possible that way. Spark streaming is a minibatch processing system. If processing contents of 2 batch is your objective what you can do is 1) keep a cache(or two) that represent the previous batch(s). 2) every new batch replaces the old cache by one time slot back. 3) you

Re: Effective ways monitor and identify that a Streaming job has been failing for the last 5 minutes

2015-12-05 Thread manasdebashiskar
spark has capability to report to ganglia, graphite or jmx. If none of that works for you you can register your own spark extra listener that does your bidding. ..Manas -- View this message in context:

Re: Spark 1.5.2 getting stuck when reading from HDFS in YARN client mode

2015-12-05 Thread manasdebashiskar
You can check the executor thread dump to see what you "stuck" executor are doing. if you are running some monitoring tool then you can check if there is a heavy io going on (or network usage during that time) A little more info on what you are trying to do would be help. ..manas -- View this

Re: Fwd: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-05 Thread manasdebashiskar
SortByKey is available for pairedRDD. So if you RDD can be implicitly transformed to a pairedRDD you can do SortByKey then on. This magic is implicitly available if you import org.apache.spark.SparkContext._ -- View this message in context:

Re: maven built the spark-1.5.2 source documents, but error happened

2015-12-05 Thread manasdebashiskar
What information do you get when you try running with -X flag? Few notes before you start buildings are 1) Use the latest maven. 2) use java_home just incase. An example would be JAVA_HOME=/usr/lib/jvm/java-8-oracle ./make-distribution.sh --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.0

spark streaming in python: questions about countByValue and countByValueAndWindow

2015-12-05 Thread krist333
I'm new to PySpark and spark streaming. According to the documentation about countByValue and countByValueAndWindow

Re: how to judge a DStream is empty or not after filter operation, so that return a boolean reault

2015-12-05 Thread manasdebashiskar
Usually while processing a DStream one uses foreachRDD. foreachRDD gives you to deal with an RDD which has a method isEmpty that you can use. ..Manas -- View this message in context:

Parquet runs out of memory when reading in a huge matrix

2015-12-05 Thread AlexG
I am trying to multiply against a large matrix that is stored in parquet format, so am being careful not to store the RDD in memory, but am getting an OOM error from the parquet reader: 15/12/06 05:23:36 WARN TaskSetManager: Lost task 950.0 in stage 4.0 (TID 28398, 172.31.34.233):

Re: Obtaining Job Id for query submitted via Spark Thrift Server

2015-12-05 Thread manasdebashiskar
spark ui has a great rest api set. http://spark.apache.org/docs/latest/monitoring.html If you know your application id the rest should be easy. ..Manas -- View this message in context:

Re: partition RDD of images

2015-12-05 Thread manasdebashiskar
You can use a custom partitioner if your need is specific in any way. If you care about ordering then you can zipWithIndex your rdd and decide based on the sequence of the message. The following partitioner should work for you. class ExactPartitioner[V]( partitions: Int, elements: Int)

Re: Experiences about NoSQL databases with Spark

2015-12-05 Thread manasdebashiskar
Depends on your need. Have you looked at Elastic search, or Accumulo or Cassandra? If post processing of your data is not your motive and you want to just retrieve the data later greenplum(based on postgresql) can be an alternative. in short there are many NOSQL out there with each having

Re: [streaming] KafkaUtils.createDirectStream - how to start streming from checkpoints?

2015-12-05 Thread manasdebashiskar
When you enable check pointing your offsets get written in zookeeper. If you program dies or shutdowns and later restarted kafkadirectstream api knows where to start by looking at those offsets from zookeeper. This is as easy as it gets. However if you are planning to re-use the same checkpoint

Re: Obtaining Job Id for query submitted via Spark Thrift Server

2015-12-05 Thread Jagrut Sharma
Thanks. I'm using the REST API to get the status of all jobs run under an application ID. /applications/[app-id]/jobs The other way to use the REST API is: /applications/[app-id]/jobs/[job-id] This should be more efficient since it would only return status of a specific job. However, the job id

Re: tmp directory

2015-12-05 Thread manasdebashiskar
If you look at your spark ui-> Environment you can see what are the path pointing to /tmp Typically java temp folder is also mapped to /tmp which can be over ridden by java opt . Spark logs go in /var/run/spark/work/... folder but I think you already know that. ..Manas -- View this message in

Re: Scala 2.11 and Akka 2.4.0

2015-12-05 Thread manasdebashiskar
There are steps to build spark using scala 2.11 in the spark docs. the first step is /dev/change-scala-version.sh 2.11 which changes the scala version to 2.11. I have not tried compiling spark with akka 2.4.0. ..Manas -- View this message in context:

Re: Experiences about NoSQL databases with Spark

2015-12-05 Thread Nick Pentreath
I've had great success using Elasticsearch with Spark - the integration works well (both ways - reading and indexing) and ES + Kibana makes a powerful event / time-series storage, aggregation and data visualization stack. — Sent from Mailbox On Sun, Dec 6, 2015 at 9:07 AM,

Re: Spark checkpointing - restrict checkpointing to local file system?

2015-12-05 Thread manasdebashiskar
Try file:///path an example would be file://localhost/tmp/text.txt or file://192.168.1.25/tmp/text.txt ...manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-checkpointing-restrict-checkpointing-to-local-file-system-tp25468p25596.html Sent from the

Re: Improve saveAsTextFile performance

2015-12-05 Thread Ram VISWANADHA
>If you are doing a join/groupBy kind of operations then you need to make sure >the keys are evenly distributed throughout the partitions. Yes I am doing join/groupBy operations.Can you point me to docs on how to do this? Spark 1.5.2 First attempt Aggregated Metrics by Executor Executor ID

Re: Spark ML Random Forest output.

2015-12-05 Thread Eugene Morozov
Benjamin, thanks a lot! -- Be well! Jean Morozov On Sat, Dec 5, 2015 at 3:46 PM, Benjamin Fradet wrote: > Hi, > > To get back the original labels after indexing them with StringIndexer, I > usually use IndexToString >

Re: Spark ML Random Forest output.

2015-12-05 Thread Benjamin Fradet
Hi, To get back the original labels after indexing them with StringIndexer, I usually use IndexToString to retrieve my original labels like so: val labelIndexer = new StringIndexer()

Re: MLlib training time question

2015-12-05 Thread Yanbo Liang
Hi Haoyue, Could you find the time spent on each stage of the LinearRegression model training at the Spark UI? It can tell us which stage is the most time-consuming and help us to analyze the cause. Yanbo 2015-12-05 15:14 GMT+08:00 Haoyue Wang : > Hi all, > I'm doing some

Possible bug in Spark 1.5.0 onwards while loading Postgres JDBC driver

2015-12-05 Thread manasdebashiskar
Hi All, Has anyone tried using user defined database api for postgres on Spark 1.5.0 onwards. I have a build that uses Spark = 1.5.1 ScalikeJDBC= 2.3+ postgres driver = postgresql-9.3-1102-jdbc41.jar Spark SQL API to write dataframe to postgres works. But writing a spark RDD to postgres using

Re: Improve saveAsTextFile performance

2015-12-05 Thread Ram VISWANADHA
I tried partitionBy with a Hashpartitioner still the same issue groupBy Operation: https://gist.github.com/ramv-dailymotion/4e19b96b625c52d7ed3b#file-saveasparquet-java-L51 Join Operation: https://gist.github.com/ramv-dailymotion/4e19b96b625c52d7ed3b#file-saveasparquet-java-L80 Best Regards,

Re: sparkavro for PySpark 1.3

2015-12-05 Thread YaoPau
Here's what I'm currently trying: -- I'm including --packages com.databricks:spark-avro_2.10:1.0.0 in my pyspark call. This seems to work: Ivy Default Cache set to: /home/jrgregg/.ivy2/cache The jars for the packages stored in: /home/jrgregg/.ivy2/jars ::

Dataset and lambas

2015-12-05 Thread Koert Kuipers
hello all, DataFrame internally uses a different encoding for values then what the user sees. i assume the same is true for Dataset? if so, does this means that a function like Dataset.map needs to convert all the values twice (once to user format and then back to internal format)? or is it

Re: Testing with spark testing base

2015-12-05 Thread Silvio Fiorito
Yes, with IntelliJ you can set up a scalatest run configuration. You can also run directly from the sbt CLI by running “sbt test” From: Masf > Date: Saturday, December 5, 2015 at 12:51 PM To: "user@spark.apache.org"

Testing with spark testing base

2015-12-05 Thread Masf
Hi. I'm testing "spark testing base". For example: class MyFirstTest extends FunSuite with SharedSparkContext{ def tokenize(f: RDD[String]) = { f.map(_.split("").toList) } test("really simple transformation"){ val input = List("hi", "hi miguel", "bye") val expected =

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-05 Thread Michael Armbrust
> > Follow up question in this case: what is the cost of registering a temp > table? Is there a limit to the number of temp tables that can be registered > by Spark context? > It is pretty cheap. Just an entry in an in-memory hashtable to a query plan (similar to a view).

Re: Exception in thread "main" java.lang.IncompatibleClassChangeError:

2015-12-05 Thread Michael Armbrust
It seems likely you have conflicting versions of hadoop on your classpath. On Fri, Dec 4, 2015 at 2:52 PM, Prem Sure wrote: > Getting below exception while executing below program in eclipse. > any clue on whats wrong here would be helpful > > *public* *class* WordCount {

Re: spark.authenticate=true YARN mode doesn't work

2015-12-05 Thread Marcelo Vanzin
Hi Prasad, please reply to the list so that others can benefit / help. On Sat, Dec 5, 2015 at 4:06 PM, Prasad Reddy wrote: > Have you had a chance to try this authentication for any of your projects > earlier. Yes, we run with authenticate=true by default. It works fine.

ClassCastException in Kryo Serialization

2015-12-05 Thread SRK
Hi, I seem to be getting class cast exception in Kryo Serialization. Following is the error. Child1 class is a map in parent class. Child1 has a hashSet testObjects of the type Object1. I get an error when it tries to deserialize Object1. Any idea as to why this is happening?

Re: Dataset and lambas

2015-12-05 Thread Deenar Toraskar
Hi Michael On a similar note, what is involved in getting native support for some user defined functions, so that they are as efficient as native Spark SQL expressions? I had one particular one - an arraySum (element wise sum) that is heavily used in a lot of risk analytics. Deenar On 5

How to identify total schedule delay in a Streaming app using Ganglia?

2015-12-05 Thread SRK
Hi, How to identify total schedule delay in a Streaming app using Ganglia? Any sample code to integrate Ganglia with a Streaming app to generate Time Delay, number of batches failed in the last 5 minutes would be of great help. Thanks, Swetha -- View this message in context:

Re: Spark ML Random Forest output.

2015-12-05 Thread Eugene Morozov
Vishnu, thanks for the response. The problem is that I actually do not have index labels, they are hidden in the dataframe as a metadata. And anyone, who'd like to use that have to apply an ugly hack. The issue might be even worse in case I serialize my model into a file for a delayed use. When

Re: the way to compare any two adjacent elements in one rdd

2015-12-05 Thread Zhiliang Zhu
For this, mapWithPartitionsWithIndex would also properly work for filter. Here is the code copied for stack-overflow, which is used to remove the first line of a csv file: JavaRDD rawInputRdd = sparkContext.textFile(dataFile); Function2 removeHeader= new Function2

8080 not working

2015-12-05 Thread Chintan Bhatt
Hi, I want to load my dataset in form of csv file. http://127.0.0.1:8080/ isn't working in hortonworks. Can anyone tell me solution for that? -- CHINTAN BHATT Assistant Professor, U & P U Patel Department of Computer Engineering, Chandubhai

Re: 8080 not working

2015-12-05 Thread Akhil Das
Can you provide more information? Are you having a spark stand alone cluster running? If not, then you won't be able to access 8080. If you want to load file data, you can open up the spark-shell and type val rawData = sc.textFile("/path/to/your/file") You can also use the

Re: Improve saveAsTextFile performance

2015-12-05 Thread Akhil Das
Which version of spark are you using? Can you look at the event timeline and the DAG of the job and see where its spending more time? .save simply triggers your entire pipeline, If you are doing a join/groupBy kind of operations then you need to make sure the keys are evenly distributed throughout

Re: Broadcasting a parquet file using spark and python

2015-12-05 Thread Michael Armbrust
I believe we started supporting broadcast outer joins in Spark 1.5. Which version are you using? On Fri, Dec 4, 2015 at 2:49 PM, Shuai Zheng wrote: > Hi all, > > > > Sorry to re-open this thread. > > > > I have a similar issue, one big parquet file left outer join quite

Re: spark.authenticate=true YARN mode doesn't work

2015-12-05 Thread Marcelo Vanzin
On Fri, Dec 4, 2015 at 5:47 PM, prasadreddy wrote: > I am running Spark YARN and trying to enable authentication by setting > spark.authenticate=true. After enable authentication I am not able to Run > Spark word count or any other programs. Define "I am not able to run".

Re: Dataset and lambas

2015-12-05 Thread Michael Armbrust
On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers wrote: > hello all, > DataFrame internally uses a different encoding for values then what the > user sees. i assume the same is true for Dataset? > This is true. We encode objects in the tungsten binary format using code

Re: sparkavro for PySpark 1.3

2015-12-05 Thread ayan guha
Hi You can use newAPIHadoopFile and use AvroInputFormat. On Sun, Dec 6, 2015 at 4:59 AM, YaoPau wrote: > Here's what I'm currently trying: > > -- > I'm including --packages com.databricks:spark-avro_2.10:1.0.0 in my pyspark > call. This

How to debug Spark source using IntelliJ/ Eclipse

2015-12-05 Thread jatinganhotra
Hi, I am trying to understand Spark internal code and wanted to debug Spark source, to add a new feature. I have tried the steps lined out here on the Spark Wiki page IDE setup , but they