Re: Saving to JDBC

2015-12-18 Thread Akhil Das
You will have to properly order the columns before writing or you can change the column order in the actual table according to your job. Thanks Best Regards On Tue, Dec 15, 2015 at 1:47 AM, Bob Corsaro wrote: > Is there anyway to map pyspark.sql.Row columns to JDBC table

Re: UNSUBSCRIBE

2015-12-18 Thread Akhil Das
Send the mail to user-unsubscr...@spark.apache.org read more over here http://spark.apache.org/community.html Thanks Best Regards On Tue, Dec 15, 2015 at 3:39 AM, Mithila Joshi wrote: > unsubscribe > > On Mon, Dec 14, 2015 at 4:49 PM, Tim Barthram

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
hi, I think that people have reported the same issue elsewhere, and this should be registered as a bug in SPARK https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html Regards, Gourav On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta wrote: > Hi

Re: How to do map join in Spark SQL

2015-12-18 Thread Akhil Das
You can broadcast your json data and then do a map side join. This article is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/ Thanks Best Regards On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov wrote: > I have big folder having ORC files. Files

Re: Using Spark to process JSON with gzip filed

2015-12-18 Thread Akhil Das
Something like this? This one uses the ZLIB compression, you can replace the decompression logic with GZip one in your case. compressedStream.map(x => { val inflater = new Inflater() inflater.setInput(x.getPayload) val decompressedData = new Array[Byte](x.getPayload.size * 2)

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
Hi, I have a table which is directly from S3 location and even a self join on that cached table is causing the data to be read from S3 again. The query plan in mentioned below: == Parsed Logical Plan == Aggregate [count(1) AS count#1804L] Project [user#0,programme_key#515] Join Inner,

Re: pyspark + kafka + streaming = NoSuchMethodError

2015-12-18 Thread Christos Mantas
Thank you, Luciano, Shixiong. I thought the "_2.11" part referred to the Kafka version - an unfortunate coincidence. Indeed spark-submit --jars spark-streaming-kafka-assembly_2.10-1.5.2.jar my_kafka_streaming_wordcount.py OR spark-submit --packages

Difference between DataFrame.cache() and hiveContext.cacheTable()?

2015-12-18 Thread Sahil Sareen
Is there any difference between the following snippets: val df = hiveContext.createDataFrame(rows, schema) df.registerTempTable("myTable") df.cache() and val df = hiveContext.createDataFrame(rows, schema) df.registerTempTable("myTable") hiveContext.cacheTable("myTable") -Sahil

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Deepak Sharma
I have never tried this but there is yarn client api's that you can use in your spark program to get the application id. Here is the link to the yarn client java doc: http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html getApplications() is the method for your

Re: Unable to get json for application jobs in spark 1.5.0

2015-12-18 Thread Akhil Das
Which version of spark are you using? You can test this by opening up a spark-shell, firing a simple job (sc.parallelize(1 to 100).collect()) and then accessing the http://sigmoid-driver:4040/api/v1/applications/Spark%20shell/jobs [image: Inline image 1] Thanks Best Regards On Tue, Dec 15, 2015

Re: about spark on hbase

2015-12-18 Thread Akhil Das
*First you create the HBase configuration:* val hbaseTableName = "paid_daylevel" val hbaseColumnName = "paid_impression" val hconf = HBaseConfiguration.create() hconf.set("hbase.zookeeper.quorum", "sigmoid-dev-master") hconf.set("hbase.zookeeper.property.clientPort",

Re: security testing on spark ?

2015-12-18 Thread Akhil Das
If the port 7077 is open for public on your cluster, that's all you need to take over the cluster. You can read a bit about it here https://www.sigmoid.com/securing-apache-spark-cluster/ You can also look at this small exploit I wrote https://www.exploit-db.com/exploits/36562/ Thanks Best

Error on using updateStateByKey

2015-12-18 Thread Abhishek Anand
I am trying to use updateStateByKey but receiving the following error. (Spark Version 1.4.0) Can someone please point out what might be the possible reason for this error. *The method updateStateByKey(Function2) in the type JavaPairDStream is

Re: Large number of conf broadcasts

2015-12-18 Thread Anders Arpteg
Awesome, thanks for the PR Koert! /Anders On Thu, Dec 17, 2015 at 10:22 PM Prasad Ravilla wrote: > Thanks, Koert. > > Regards, > Prasad. > > From: Koert Kuipers > Date: Thursday, December 17, 2015 at 1:06 PM > To: Prasad Ravilla > Cc: Anders Arpteg, user > > Subject: Re:

Re: spark master process shutdown for timeout

2015-12-18 Thread Akhil Das
Did you happened to have a look at this https://issues.apache.org/jira/browse/SPARK-9629 Thanks Best Regards On Thu, Dec 17, 2015 at 12:02 PM, yaoxiaohua wrote: > Hi guys, > > I have two nodes used as spark master, spark1,spark2 > > Spark1.4.0 > > Jdk

how to turn off spark streaming gracefully ?

2015-12-18 Thread kali.tumm...@gmail.com
Hi All, Imagine I have a Production spark streaming kafka (direct connection) subscriber and publisher jobs running which publish and subscriber (receive) data from a kafka topic and I save one day's worth of data using dstream.slice to Cassandra daily table (so I create daily table before

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Kyle Lin
Hello there I have the same requirement. I submit a streaming job with yarn-cluster mode. If I want to shutdown this endless YARN application, I should find out the application id by myself and use "yarn appplication -kill " to kill the application. Therefore, if I can get returned application

Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Sahil Sareen
Spark 1.5.2 dfOld.registerTempTable("oldTableName") sqlContext.cacheTable("oldTableName") // // do something // dfNew.registerTempTable("oldTableName") sqlContext.cacheTable("oldTableName") Now when I use the "oldTableName" table I do get the latest contents from dfNew but do the

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Ted Yu
CacheManager#cacheQuery() is called where: * Caches the data produced by the logical representation of the given [[Queryable]]. ... val planToCache = query.queryExecution.analyzed if (lookupCachedData(planToCache).nonEmpty) { Is the schema for dfNew different from that of dfOld ?

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Gourav Sengupta
Hi, the attached DAG shows that for the same table (self join) SPARK is unnecessarily getting data from S3 for one side of the join where as its able to use cache for the other side. Regards, Gourav On Fri, Dec 18, 2015 at 10:29 AM, Gourav Sengupta wrote: > Hi, >

Re: HiveContext Self join not reading from cache

2015-12-18 Thread Ted Yu
The picture is a bit hard to read. I did a brief search but haven't found JIRA for this issue. Consider logging a SPARK JIRA. Cheers On Fri, Dec 18, 2015 at 4:37 AM, Gourav Sengupta wrote: > Hi, > > the attached DAG shows that for the same table (self join) SPARK

Re: how to turn off spark streaming gracefully ?

2015-12-18 Thread Cody Koeninger
If you're really doing a daily batch job, have you considered just using KafkaUtils.createRDD rather than a streaming job? On Fri, Dec 18, 2015 at 5:04 AM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > Hi All, > > Imagine I have a Production spark streaming kafka (direct connection)

Re: how to turn off spark streaming gracefully ?

2015-12-18 Thread Cody Koeninger
You'll need to keep track of the offsets. On Fri, Dec 18, 2015 at 9:51 AM, sri hari kali charan Tummala < kali.tumm...@gmail.com> wrote: > Hi Cody, > > KafkaUtils.createRDD totally make sense now I can run my spark job once in > 15 minutes extract data out of kafka and stop ..., I rely on kafka

Limit of application submission to cluster

2015-12-18 Thread Saif.A.Ellafi
Hello everyone, I am testing some parallel program submission to a stand alone cluster. Everything works alright, the problem is, for some reason, I can't submit more than 3 programs to the cluster. The fourth one, whether legacy or REST, simply hangs until one of the first three completes. I

Re: Spark with log4j

2015-12-18 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtEor1vYWbsW which mentioned: SPARK-11105 Disitribute the log4j.properties files from the client to the executors FYI On Fri, Dec 18, 2015 at 7:23 AM, Kalpesh Jadhav < kalpesh.jad...@citiustech.com> wrote: > Hi all, > > > > I am new to spark, I am

Re: how to turn off spark streaming gracefully ?

2015-12-18 Thread sri hari kali charan Tummala
Hi Cody, KafkaUtils.createRDD totally make sense now I can run my spark job once in 15 minutes extract data out of kafka and stop ..., I rely on kafka offset for Incremental data am I right ? so no duplicate data will be returned. Thanks Sri On Fri, Dec 18, 2015 at 2:41 PM, Cody Koeninger

Spark with log4j

2015-12-18 Thread Kalpesh Jadhav
Hi all, I am new to spark, I am trying to use log4j for logging my application. But any how the logs are not getting written at specified file. I have created application using maven, and kept log.properties file at resources folder. Application written in scala . If there is any

Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Prasad Ravilla
Hi, I am running into performance issue when joining data frames created from avro files using spark-avro library. The data frames are created from 120K avro files and the total size is around 1.5 TB. The two data frames are very huge with billions of records. The join for these two

Configuring log4j

2015-12-18 Thread Afshartous, Nick
Hi, Am trying to configure log4j on an AWS EMR 4.2 Spark cluster for a streaming job set in client mode. I changed /etc/spark/conf/log4j.properties to use a FileAppender. However the INFO logging still goes to console. Thanks for any suggestions, -- Nick >From the console:

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Ted Yu
This method in CacheManager: private[sql] def lookupCachedData(plan: LogicalPlan): Option[CachedData] = readLock { cachedData.find(cd => plan.sameResult(cd.plan)) Ied me to the following in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala : def

Re: [Spark-1.5.2][Hadoop-2.6][Spark SQL] Cannot run queries in SQLContext, getting java.lang.NoSuchMethodError

2015-12-18 Thread Matheus Ramos
That was exactly the problem Michael, mode details in this post: http://stackoverflow.com/questions/34184079/cannot-run-queries-in-sqlcontext-from-apache-spark-sql-1-5-2-getting-java-lang *Matheus* On Wed, Dec 9, 2015 at 4:43 PM, Michael Armbrust wrote: >

Spark batch getting hung up

2015-12-18 Thread SRK
Hi, My Spark Batch job seems to hung up sometimes for a long time before it starts the next stage/exits. Basically it happens when it has mapPartition/foreachPartition in a stage. Any idea as to why this is happening? Thanks, Swetha -- View this message in context:

Question about Spark Streaming checkpoint interval

2015-12-18 Thread Lan Jiang
Need some clarification about the documentation. According to Spark doc "the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Sahil Sareen
Thanks Ted! Yes, The schema might be different or the same. What would be the answer for each situation? On Fri, Dec 18, 2015 at 6:02 PM, Ted Yu wrote: > CacheManager#cacheQuery() is called where: > * Caches the data produced by the logical representation of the given >

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Sahil Sareen
So I looked at the function, my only worry is that the cache should be cleared if I'm overwriting the cache with the same table name. I did this experiment and the cache shows as table not cached but want to confirm this. In addition to not using the old table values is it actually

Re: Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Ted Yu
Can you try the lastest 1.6.0 RC which includes SPARK-1 ? Cheers On Fri, Dec 18, 2015 at 7:38 AM, Prasad Ravilla wrote: > Hi, > > I am running into performance issue when joining data frames created from > avro files using spark-avro library. > > The data frames are

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Ted Yu
When second attempt is made to cache df3 which has same schema as the first DataFrame, you would see the warning below: scala> sqlContext.cacheTable("t1") scala> sqlContext.isCached("t1") res5: Boolean = true scala> sqlContext.sql("select * from t1").show +---+---+ | a| b| +---+---+ | 1| 1|

hive on spark

2015-12-18 Thread Ophir Etzion
During spark-submit when running hive on spark I get: Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated Caused by: java.lang.IllegalAccessError: tried to access method

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Sahil Sareen
>From the UI I see two rows for this on a streaming application: RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in ExternalBlockStoreSize on DiskIn-memory table myColorsTableMemory Deserialized 1x Replicated2100%728.2 KB0.0 B0.0 BIn-memory table myColorsTableMemory

Re: seriazable error in apache spark job

2015-12-18 Thread Shixiong Zhu
Looks you have a reference to some Akka class. Could you post your codes? Best Regards, Shixiong Zhu 2015-12-17 23:43 GMT-08:00 Pankaj Narang : > I am encountering below error. Can somebody guide ? > > Something similar is one this link >

ALS predictAll does not generate all the user/item ratings

2015-12-18 Thread Roberto Pagliari
I created the following data, data.file 1 1 1 2 1 3 2 4 3 5 4 6 5 7 6 1 7 2 8 8 The following code: def parse_line(line): tokens = line.split(' ') return (int(tokens[0]), int(tokens[1])), 1.0 lines = sc.textFile('./data.file') linesTest = sc.textFile('./data.file')

Re: Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Prasad Ravilla
Changing equality check from “<=>”to “===“ solved the problem. Performance skyrocketed. I am wondering why “<=>” cause a performance degrade? val dates = new RetailDates() val dataStructures = new DataStructures() // Reading CSV Format input files -- retailDates // This DF has 75 records val

Re: "Ambiguous references" to a field set in a partitioned table AND the data

2015-12-18 Thread sim
See https://issues.apache.org/jira/browse/SPARK-7301 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ambiguous-references-to-a-field-set-in-a-partitioned-table-AND-the-data-tp22325p25740.html Sent from the Apache Spark User List mailing list archive at

Re: Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Michael Armbrust
This is fixed in Spark 1.6. On Fri, Dec 18, 2015 at 3:06 PM, Prasad Ravilla wrote: > Changing equality check from “<=>”to “===“ solved the problem. > Performance skyrocketed. > > I am wondering why “<=>” cause a performance degrade? > > val dates = new RetailDates() > val

How to run multiple Spark jobs as a workflow that takes input from a Streaming job in Oozie

2015-12-18 Thread SRK
Hi, How to run multiple Spark jobs that takes Spark Streaming data as the input as a workflow in Oozie? We have to run our Streaming job first and then have a workflow of Spark Batch jobs to process the data. Any suggestions on this would be of great help. Thanks! -- View this message in

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Andrew Or
Hi Roy, I believe Spark just gets its application ID from YARN, so you can just do `sc.applicationId`. -Andrew 2015-12-18 0:14 GMT-08:00 Deepak Sharma : > I have never tried this but there is yarn client api's that you can use in > your spark program to get the

Re: which aws instance type for shuffle performance

2015-12-18 Thread Andrew Or
Hi Rastan, Unless you're using off-heap memory or starting multiple executors per machine, I would recommend the r3.2xlarge option, since you don't actually want gigantic heaps (100GB is more than enough). I've personally run Spark on a very large scale with r3.8xlarge instances, but I've been

Re: Configuring log4j

2015-12-18 Thread Afshartous, Nick
Found the issue, a conflict between setting Java options in both spark-defaults.conf and in the spark-submit. -- Nick From: Afshartous, Nick Sent: Friday, December 18, 2015 11:46 AM To: user@spark.apache.org Subject: Configuring

Re: Question about Spark Streaming checkpoint interval

2015-12-18 Thread Shixiong Zhu
You are right. "checkpointInterval" is only for data checkpointing. "metadata checkpoint" is done for each batch. Feel free to send a PR to add the missing doc. Best Regards, Shixiong Zhu 2015-12-18 8:26 GMT-08:00 Lan Jiang : > Need some clarification about the documentation.

Re: Dynamic jar loading

2015-12-18 Thread Jim Lohse
I am going to say no, but have not actually tested this. Just going on this line in the docs: http://spark.apache.org/docs/latest/configuration.html |spark.driver.extraClassPath| (none) Extra classpath entries to prepend to the classpath of the driver. /Note:/ In client mode, this config

Re: Limit of application submission to cluster

2015-12-18 Thread Andrew Or
Hi Saif, have you verified that the cluster has enough resources for all 4 programs? -Andrew 2015-12-18 5:52 GMT-08:00 : > Hello everyone, > > I am testing some parallel program submission to a stand alone cluster. > Everything works alright, the problem is, for

Re: imposed dynamic resource allocation

2015-12-18 Thread Andrew Or
Hi Antony, The configuration to enable dynamic allocation is per-application. If you only wish to enable this for one of your applications, just set `spark.dynamicAllocation.enabled` to true for that application only. The way it works under the hood is that application will start sending

Re: Is DataFrame.groupBy supposed to preserve order within groups?

2015-12-18 Thread Michael Armbrust
You need to use window functions to get this kind of behavior. Or use max and a struct ( http://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record) On Thu, Dec 17, 2015 at 11:55 PM, Timothée Carayol < timothee.cara...@gmail.com> wrote: > Hi all, > > I tried to do something

Re: which aws instance type for shuffle performance

2015-12-18 Thread Alexander Pivovarov
Andrew, it's going to be 4 execotor jvms on each r3.8xlarge. Rastan, you can run quick test using emr spark cluster on spot instances and see what configuration works better. Without the tests it is all speculation. On Dec 18, 2015 1:53 PM, "Andrew Or" wrote: > Hi Rastan,

Spark Streaming, PySpark 1.3, randomly losing connection

2015-12-18 Thread YaoPau
Hi - I'm running Spark Streaming using PySpark 1.3 in yarn-client mode on CDH 5.4.4. The job sometimes runs a full 24hrs, but more often it fails sometime during the day. I'm getting several vague errors that I don't see much about when searching online: - py4j.Py4JException: Error while

how to fetch all of data from hbase table in spark java

2015-12-18 Thread Sateesh Karuturi
Hello experts... i am new to spark, anyone please explain me how to fetch data from hbase table in spark java Thanks in Advance...