Spark SQL -JDBC connectivity

2016-08-09 Thread Soni spark
Hi, I would to know the steps to connect SPARK SQL from spring framework (Web-UI). also how to run and deploy the web application?

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
Hi Siva, Does topic has partitions? which version of Spark you are using? On Wed, Aug 10, 2016 at 2:38 AM, Sivakumaran S wrote: > Hi, > > Here is a working example I did. > > HTH > > Regards, > > Sivakumaran S > > val topics = "test" > val brokers = "localhost:9092" > val

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
It stops working at sqlContext.read.json(rdd.map(_._2)) . Topics without partitions is working fine. Do I need to set any other configs val kafkaParams = Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092", " group.id" -> "xyz","auto.offset.reset"->"smallest") Spark version is

Re: Get distinct column data from grouped data

2016-08-09 Thread Selvam Raman
my frined suggest this way val fil = sc.textFile("hdfs:///user/vijayc/data/test-spk.tx") val res =fil.map(l => l.split(",")).map(l =>( l(0),l(1))).groupByKey.map(rd =>(rd._1,rd._2.toList.distinct)) another useful function is *collect_set* in dataframe. Thanks, selvam R On Tue, Aug 9, 2016

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
No, you don't need a conditional. read.json on an empty rdd will return an empty dataframe. Foreach on an empty dataframe or an empty rdd won't do anything (a task will still get run, but it won't do anything). Leave the conditional out. Add one thing at a time to the working rdd.foreach

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
Hi Cody, Without conditional . It is going with fine. But any processing inside conditional get on to waiting (or) something. Facing this issue with partitioned topics. I would need conditional to skip processing when batch is empty. kafkaStream.foreachRDD( rdd => { val dataFrame =

UNSUBSCRIBE

2016-08-09 Thread James Ding
smime.p7s Description: S/MIME cryptographic signature

Re: Cumulative Sum function using Dataset API

2016-08-09 Thread Jon Barksdale
Cool, learn something new every day. Thanks again. On Tue, Aug 9, 2016 at 4:08 PM ayan guha wrote: > Thanks for reporting back. Glad it worked for you. Actually sum with > partitioning behaviour is same in oracle too. > On 10 Aug 2016 03:01, "Jon Barksdale"

Re: spark 2.0 in intellij

2016-08-09 Thread Michael Jay
Hi, The problem has been solved simply by updating the scala sdk version from incompactible 2.10.x to correct version 2.11.x From: Michael Jay Sent: Tuesday, August 9, 2016 10:11:12 PM To: user@spark.apache.org Subject: spark 2.0 in

Re: Cumulative Sum function using Dataset API

2016-08-09 Thread ayan guha
Thanks for reporting back. Glad it worked for you. Actually sum with partitioning behaviour is same in oracle too. On 10 Aug 2016 03:01, "Jon Barksdale" wrote: > Hi Santoshakhilesh, > > I'd seen that already, but I was trying to avoid using rdds to perform > this

Re: DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Davies Liu
I think you are looking for `def repartition(numPartitions: Int, partitionExprs: Column*)` On Tue, Aug 9, 2016 at 9:36 AM, Stephen Fletcher wrote: > Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD? > I'm reading data from a file data source

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Mich Talebzadeh
Hi, Is this table created as external table in Hive? Do you see data through Spark-sql or Hive thrift server. There is an issue with Zeppelin seeing data when connecting to Spark Thrift Server. Rows display null value. HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Davies Liu
Can you get all the fields back using Scala or SQL (bin/spark-sql)? On Tue, Aug 9, 2016 at 2:32 PM, cdecleene wrote: > Some details of an example table hive table that spark 2.0 could not read... > > SerDe Library: > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

Re: Spark on mesos in docker not getting parameters

2016-08-09 Thread Michael Gummelt
> However, they are missing in subsequent child processes and the final java process started doesn't contain them either. I don't see any evidence of this in your process list. `launcher.Main` is not the final java process. `launcher.Main` prints a java command, which `spark-class` then runs.

Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread cdecleene
Some details of an example table hive table that spark 2.0 could not read... SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat:

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
Can you give some outline as to what you mean? Should I broadcast a dataframe, and register the broadcasted df as a temp table? And then use a lookup UDF in a SELECT query? I've managed to get it working by loading the 1.5GB dataset into an embedded redis instance on the driver, and used a

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor
Thanks, that makes sense. So it must be that this queue - which is kept because of the UDF - is the one running out of memory, because without the UDF field there is no out of memory error and the UDF fields is pretty small, unlikely that it would take us above the memory limit. In either case,

UNSUBSCRIBE

2016-08-09 Thread abhishek singh

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Sivakumaran S
Hi, Here is a working example I did. HTH Regards, Sivakumaran S val topics = "test" val brokers = "localhost:9092" val topicsSet = topics.split(",").toSet val sparkConf = new SparkConf().setAppName("KafkaWeatherCalc").setMaster("local") //spark://localhost:7077 val sc = new

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Davies Liu
When you have a Python UDF, only the input to UDF are passed into Python process, but all other fields that are used together with the result of UDF are kept in a queue then join with the result from Python. The length of this queue is depend on the number of rows is under processing by Python (or

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
Take out the conditional and the sqlcontext and just do rdd => { rdd.foreach(println) as a base line to see if you're reading the data you expect On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi wrote: > Hi, > > I am reading json messages from kafka . Topics

Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
Hi, I am reading json messages from kafka . Topics has 2 partitions. When running streaming job using spark-submit, I could see that * val dataFrame = sqlContext.read.json(rdd.map(_._2)) *executes indefinitely. Am I doing something wrong here. Below is code .This environment is cloudera sandbox

Re: Writing all values for same key to one file

2016-08-09 Thread neil90
Why not just create a partitions for they key you want to groupby and save it in there? Appending to a file already written to HDFS isn't the best idea IMO. -- View this message in context:

Re: Spark join and large temp files

2016-08-09 Thread Gourav Sengupta
In case of skewed data the joins will mess things up. Try to write a UDF with the lookup on broadcast variable and then let me know the results. It should not take more than 40 mins in a 32 GB RAM system with 6 core processors. Gourav On Tue, Aug 9, 2016 at 6:02 PM, Ashic Mahtab

spark 2.0 in intellij

2016-08-09 Thread Michael Jay
Dear all, I am Newbie to Spark. Currently I am trying to import the source code of Spark 2.0 as a Module to an existing client project. I have imported Spark-core, Spark-sql and Spark-catalyst as maven dependencies in this client project. During compilation errors as missing

Unsubscribe.

2016-08-09 Thread Martin Somers
Unsubscribe. Thanks M

Unsubscribe

2016-08-09 Thread Hogancamp, Aaron
Unsubscribe. Thanks, Aaron Hogancamp Data Scientist

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
alrighty then! bcc'ing user list. cc'ing dev list. @user list people: do not read any further or you will be in violation of ASF policies! On Tue, Aug 9, 2016 at 11:50 AM, Mark Hamstra wrote: > That's not going to happen on the user list, since that is against ASF >

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Sean Owen
Nightlies are built and made available in the ASF snapshot repo, from master. This is noted at the bottom of the downloads page, and at https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-NightlyBuilds . This hasn't changed in as long as I can recall.

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra
That's not going to happen on the user list, since that is against ASF policy (http://www.apache.org/dev/release.html): During the process of developing software and preparing a release, various > packages are made available to the developer community for testing > purposes. Do not include any

Re: Spark Job Doesn't End on Mesos

2016-08-09 Thread Michael Gummelt
Is this a new issue? What version of Spark? What version of Mesos/libmesos? Can you run the job with debug logging turned on and attach the output? Do you see the corresponding message in the mesos master that indicates it received the teardown? On Tue, Aug 9, 2016 at 1:28 AM, Todd Leo

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
this is a valid question. there are many people building products and tooling on top of spark and would like access to the latest snapshots and such. today's ink is yesterday's news to these people - including myself. what is the best way to get snapshot releases including nightly and

Unsubscribe

2016-08-09 Thread Aakash Basu

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor
> Does this mean you only have 1.6G memory for executor (others left for Python) ? > The cached table could take 1.5G, it means almost nothing left for other things. True. I have also tried with memoryOverhead being set to 800 (10% of the 8Gb memory), but no difference. The "GC overhead limit

Sparking Water (Spark 1.6.0 + H2O 3.8.2.6 ) on CDH 5.7.1

2016-08-09 Thread RK Aduri
All, Ran into one strange issue. If I initialize a h2o context and start it (NOT using it anywhere) , the count action on spark data frame would result in an error. The same count action on the spark data frame would work fine without h20 context not being initialized. hc =

Spark on mesos in docker not getting parameters

2016-08-09 Thread Jim Carroll
I'm running spark 2.0.0 on Mesos using spark.mesos.executor.docker.image to point to a docker container that I built with the Spark installation. Everything is working except the Spark client process that's started inside the container doesn't get any of my parameters I set in the spark config in

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
Hi Mich,Hardware: AWS EMR cluster with 15 nodes with Rx3.2xlarge (CPU, RAM fine, disk a couple of hundred gig). When we do: onPointFiveTB.join(onePointFiveGig.cache(), "id") we're seing that the temp directory is filling up fast, until a node gets killed. And then everything dies. -Ashic.

Re: Cumulative Sum function using Dataset API

2016-08-09 Thread Jon Barksdale
Hi Santoshakhilesh, I'd seen that already, but I was trying to avoid using rdds to perform this calculation. @Ayan, it seems I was mistaken, and doing a sum(b) over(order by b) totally works. I guess I expected the windowing with sum to work more like oracle. Thanks for the suggestion :)

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mich Talebzadeh
LOL Ink has not dried on Spark 2 yet so to speak :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Mark Hamstra
What are you expecting to find? There currently are no releases beyond Spark 2.0.0. On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma wrote: > If we want to use versions of Spark beyond the official 2.0.0 release, > specifically on Maven + Java, what steps should we take to

Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Jestin Ma
If we want to use versions of Spark beyond the official 2.0.0 release, specifically on Maven + Java, what steps should we take to upgrade? I can't find the newer versions on Maven central. Thank you! Jestin

DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Stephen Fletcher
Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD? I'm reading data from a file data source and I want to partition this data I'm reading in to be partitioned the same way as the data I'm processing through a spark streaming RDD in the process.

Re: Spark join and large temp files

2016-08-09 Thread Mich Talebzadeh
Hi Sam, What is your spark Hardware spec, No of nodes, RAM per node and disks please? I don't understand this should not really be an issue. Underneath the bonnet it is a hash join. The small table I gather can be cached and the big table will do multiple passes using the temp space. HTH Dr

Spark 1.6.1 and regexp_replace

2016-08-09 Thread Andrés Ivaldi
I'm having a strange behaviour with regular expression replace, I'm trying to remove the spaces with trim and also remove the spaces when they are more than one to only one. Given a string like this " A B " with trim only I got "A B" so perfect, if I add regexp_replace I got " A B". Text1

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's no progress. The spark UI doesn't even show up. -Ashic. From: samkiller@gmail.com Date: Tue, 9 Aug 2016 16:21:27 +0200 Subject: Re: Spark join and large temp files To: as...@live.com CC: deepakmc...@gmail.com;

Re: Spark join and large temp files

2016-08-09 Thread Sam Bessalah
Have you tried to broadcast your small table table in order to perform your join ? joined = bigDF.join(broadcast(smallDF, ) On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab wrote: > Hi Deepak, > No...not really. Upping the disk size is a solution, but more expensive as > you

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
Hi Deepak,No...not really. Upping the disk size is a solution, but more expensive as you can't attach EBS volumes to EMR clusters configured with data pipelines easily (which is what we're doing). I've tried collecting the 1.5G dataset in a hashmap, and broadcasting. Timeouts seems to prevent

Re: update specifc rows to DB using sqlContext

2016-08-09 Thread Mich Talebzadeh
Hi, 1. what is the underlying DB, say Hive etc 2. Is table transactional or you are going to do insert/overwrite to the same table 3. can you do all this in the database itself assuming it is an RDBMS 4. Can you provide the sql or pseudo code for such an update HTH Dr Mich

update specifc rows to DB using sqlContext

2016-08-09 Thread sujeet jog
Hi, Is it possible to update certain columnr records in DB from spark, for example i have 10 rows with 3 columns which are read from Spark SQL, i want to update specific column entries and write back to DB, but since RDD"s are immutable i believe this would be difficult, is there a

OrientDB through JDBC: query field names wrapped by double quote

2016-08-09 Thread Roberto Franchini
Hi to all, I'm the maintainer of the JDBC driver OrientDB. We are trying to fetch data to Spark from an orientDB using theJDBC driver. I'm facing some issues: To gather metadata spark performs a "test query" of this form: select * from TABLE_NAME whre 1=0 For this case, I write a workaround

[Spark 1.6]-increment value column based on condition + Dataframe

2016-08-09 Thread Divya Gehlot
Hi, I have column values having values like Value 30 12 56 23 12 16 12 89 12 5 6 4 8 I need create another column if col("value") > 30 1 else col("value") < 30 newColValue 0 1 0 1 2 3 4 0 1 2 3 4 5 How can I have create an increment column The grouping is happening based on some other cols

Re: Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread Aasish Kumar
Hi Sandeep, I have not enabled check pointing. I will try enabling check pointing and observe the memory pattern. but what you really want to correlate with check pointing . I don't know much about check-pointing. Thanks and rgds Aashish Kumar Software Engineer Avekshaa Technologies (P) Ltd.

Re: Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread Sandeep Nemuri
Hi Aashish, Do you have checkpointing enabled ? if not, Can you try enabling checkpointing and observe the memory pattern. Thanks, Sandeep ᐧ On Tue, Aug 9, 2016 at 4:25 PM, Mich Talebzadeh wrote: > Hi Aashish, > > You are running in standalone mode with one node > >

Re: Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread Mich Talebzadeh
Hi Aashish, You are running in standalone mode with one node As I read you start master and 5 workers pop up from SPARK_WORKER_INSTANCES=5. I gather you use start-slaves.sh? Now that is the number of workers and low memory on them port 8080 should show practically no memory used (idle). Also

Get distinct column data from grouped data

2016-08-09 Thread Selvam Raman
Example: sel1 test sel1 test sel1 ok sel2 ok sel2 test expected result: sel1, [test,ok] sel2,[test,ok] How to achieve the above result using spark dataframe. please suggest me. -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Spark Streaming Job Keeps growing memory over time

2016-08-09 Thread aasish.kumar
Hi, I am running spark v 1.6.1 on a single machine in standalone mode, having 64GB RAM and 16cores. I have created five worker instances to create five executor as in standalone mode, there cannot be more than one executor in one worker node. *Configuration*: SPARK_WORKER_INSTANCES 5

RE: Cumulative Sum function using Dataset API

2016-08-09 Thread Santoshakhilesh
You could check following link. http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark From: Jon Barksdale [mailto:jon.barksd...@gmail.com] Sent: 09 August 2016 08:21 To: ayan guha Cc: user Subject: Re: Cumulative Sum function using Dataset API I don't think that

Re: coalesce serialising earlier work

2016-08-09 Thread ayan guha
How about running a count step to force spark to materialise data frame and then repartition to 1? On 9 Aug 2016 17:11, "Adrian Bridgett" wrote: > In short: df.coalesce(1).write seems to make all the earlier calculations > about the dataframe go through a single task

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Another follow-up: I have narrowed it down to the first 32 partitions, but from that point it gets strange. Here's the error: In [68]: spark.read.parquet(*subdirs[:32]) ... AnalysisException: u'Unable to infer schema for ParquetFormat at

回复:saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001
maybe this problem is not so easy to understand, so I attached my full code. Hope this could help in solving the problem. ThanksBest regards! San.Luo - 原始邮件 - 发件人: 收件人:"user" 主题:saving DF to HDFS in

Re: SparkR error when repartition is called

2016-08-09 Thread Felix Cheung
I think it's saying a string isn't being sent properly from the JVM side. Does it work for you if you change the dapply UDF to something simpler? Do you have any log from YARN? _ From: Shane Lee >

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Some follow-up information: - dataset size is ~150G - the data is partitioned by one of the columns, _locality_code: $ ls -1 _locality_code=AD _locality_code=AE _locality_code=AF _locality_code=AG _locality_code=AI _locality_code=AL _locality_code=AM _locality_code=AN _locality_code=YE

Spark Job Doesn't End on Mesos

2016-08-09 Thread Todd Leo
Hi, I’m running Spark jobs on Mesos. When the job finishes, *SparkContext* is manually closed by sc.stop(). Then Mesos log shows: I0809 15:48:34.132014 11020 sched.cpp:1589] Asked to stop the driver I0809 15:48:34.132181 11277 sched.cpp:831] Stopping framework

RE: bisecting kmeans model tree

2016-08-09 Thread Huang, Qian
There seems to be an existing JIRA for this. https://issues.apache.org/jira/browse/SPARK-11664 From: Yanbo Liang [mailto:yblia...@gmail.com] Sent: Saturday, July 16, 2016 6:18 PM To: roni Cc: user@spark.apache.org Subject: Re: bisecting kmeans model tree Currently we do

Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Hi everyone I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue reading the existing data. Here's how the traceback looks in spark-shell: scala> spark.read.parquet("/path/to/data") org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /path/to/data.

saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001
hi there:I got a problem in saving a DF to HDFS as parquet format very slow. And I attached a pic which shows a lot of time is spent in getting result.the code is :streamingData.write.mode(SaveMode.Overwrite).parquet("/data/streamingData") I don't quite understand why my app is so slow in

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-09 Thread Sean Owen
Fewer features doesn't necessarily mean better predictions, because indeed you are subtracting data. It might, because when done well you subtract more noise than signal. It is usually done to make data sets smaller or more tractable or to improve explainability. But you have an unsupervised

Re: SparkR error when repartition is called

2016-08-09 Thread Shane Lee
Sun, I am using spark in yarn client mode in a 2-node cluster with hadoop-2.7.2. My R version is 3.3.1. I have the following in my spark-defaults.conf:spark.executor.extraJavaOptions =-XX:+PrintGCDetails

coalesce serialising earlier work

2016-08-09 Thread Adrian Bridgett
In short: df.coalesce(1).write seems to make all the earlier calculations about the dataframe go through a single task (rather than on multiple tasks and then the final dataframe to be sent through a single worker). Any idea how we can force the job to run in parallel? In more detail: We

Re: SparkR error when repartition is called

2016-08-09 Thread Sun Rui
I can’t reproduce your issue with len=1 in local mode. Could you give more environment information? > On Aug 9, 2016, at 11:35, Shane Lee wrote: > > Hi All, > > I am trying out SparkR 2.0 and have run into an issue with repartition. > > Here is the R code

How to get the parameters of bestmodel while using paramgrid and crossvalidator?

2016-08-09 Thread colin
I'm using CrossValidator and paramgrid to find the best parameters of my model. After crossvalidate, I got a CrossValidatorModel but when I want to get the parameters of my pipeline, there's nothing in the parameter list of bestmodel. Here's the code runing on jupyter notebook: sq=SQLContext(sc)