How do I parallize Spark Jobs at Executor Level.

2015-10-28 Thread Vinoth Sankar
Hi, I'm reading and filtering large no of files using Spark. It's getting parallized at Spark Driver level only. How do i make it parallelize to Executor(Worker) Level. Refer the following sample. Is there any way to paralleling iterate the localIterator ? Note : I use Java 1.7 version JavaRDD

Re: Spark Core Transitive Dependencies

2015-10-28 Thread Furkan KAMACI
Hi Deng, Could you give an example of which libraries you include for your purpose? Kind Regards, Furkan KAMACI On Wed, Oct 28, 2015 at 4:07 AM, Deng Ching-Mallete wrote: > Hi, > > The spark assembly jar already includes the spark core libraries plus > their transitive

Why is no predicate pushdown performed, when using Hive (HiveThriftServer2) ?

2015-10-28 Thread Martin Senne
Hi all, # Programm Sketch 1. I create a HiveContext `hiveContext` 2. With that context, I create a DataFrame `df` from a JDBC relational table. 3. I register the DataFrame `df` via df.registerTempTable("TESTTABLE") 4. I start a HiveThriftServer2 via

SparkR 1.5.1 ClassCastException when working with CSV files

2015-10-28 Thread rporcio
Hi, When I'm working with csv files in R using SparkR, I got ClassCastException during the execution of SparkR methods. The below process works fine in 1.4.1, but it is broken from 1.5.0. (I will use the flights csv file from the examples as a reference, but I can reproduce this with any csv

Prevent partitions from moving

2015-10-28 Thread t3l
I have a cluster with 2 nodes (32 CPU cores each). My data is distributed evenly, but the processing times for each partition can vary greatly. Now, sometimes Spark seems to conclude from the current workload on both nodes that it might be better to shift one partition from node1 to node2 (because

Re: Getting info from DecisionTreeClassificationModel

2015-10-28 Thread Yanbo Liang
AFAIK, you can not traverse the tree from the rootNode of DecisionTreeClassificationModel, because type Node does not have information of its children. Type InternalNode has children information but it's private that users can not access. I think the best way to get the probability of each

nested select is not working in spark sql

2015-10-28 Thread Kishor Bachhav
Hi, I am trying to execute below query in spark sql but throws exception select n_name from NATION where n_regionkey = (select r_regionkey from REGION where r_name='ASIA') Exception: Exception in thread "main" java.lang.RuntimeException: [1.55] failure: ``)'' expected but identifier r_regionkey

Hive Version

2015-10-28 Thread Bryan Jeffrey
All, I am using a HiveContext to create persistent tables from Spark. I am using the Spark 1.4.1 (Scala 2.11) built-in Hive support. What version of Hive does the Spark Hive correspond to? I ask because AVRO format and Timestamps in Parquet do not appear to be supported. I have searched a lot

Re: Hive Version

2015-10-28 Thread Michael Armbrust
Documented here: http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#interacting-with-different-versions-of-hive-metastore In 1.4.1 we compile against 0.13.1 On Wed, Oct 28, 2015 at 2:26 PM, Bryan Jeffrey wrote: > All, > > I am using a HiveContext to create

Apache Spark on Raspberry Pi Cluster with Docker

2015-10-28 Thread Mark Bonnekessel
Hi, we are trying to setup apache spark on a raspberry pi cluster for educational use. Spark is installed in a docker container and all necessary ports are exposed. After we start master and workers, all workers are listed as alive in the master web ui (http://master:8080

Building spark-1.5.x and MQTT

2015-10-28 Thread Bob Corsaro
Has anyone successful built this? I'm trying to determine if there is a defect in the source package or something strange about my environment. I get a FileNotFound exception on MQTTUtils.class during the build of the MQTT module. The only work around I've found is to remove the MQTT modules from

No way to supply hive-site.xml in yarn client mode?

2015-10-28 Thread Zoltan Fedor
Hi, We have a shared CDH 5.3.3 cluster and trying to use Spark 1.5.1 on it in yarn client mode with Hive. I have compiled Spark 1.5.1 with SPARK_HIVE=true, but it seems I am not able to make SparkSQL to pick up the hive-site.xml when runnig pyspark. hive-site.xml is located in

Re: SparkR 1.5.1 ClassCastException when working with CSV files

2015-10-28 Thread rporcio
It seems that the cause of this exception was the wrong version of the spark-csv package. After I upgraded it to the latest (1.2.0) version, the exception is gone and it works fine. -- View this message in context:

Re: Building spark-1.5.x and MQTT

2015-10-28 Thread Bob Corsaro
Built from http://mirror.olnevhost.net/pub/apache/spark/spark-1.5.1/spark-1.5.1.tgz using the following command: build/mvn -DskipTests=true -Dhadoop.version=2.4.1 -P"hadoop-2.4,kinesis-asl,netlib-lgpl" package install build/mvn is from the packaged source. Tried on a couple of ubuntu boxen and

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-28 Thread Adrian Tanase
The first line is distributing your fileList variable in the cluster as a RDD, partitioned using the default partitioner settings (e.g. Number of cores in your cluster). Each of your workers would one or more slices of data (depending on how many cores each executor has) and the abstraction is

Re: Building spark-1.5.x and MQTT

2015-10-28 Thread Ted Yu
MQTTUtils.class is generated from external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/MQTTUtils.scala What command did you use to build ? Which release / branch were you building ? Thanks On Wed, Oct 28, 2015 at 6:19 AM, Bob Corsaro wrote: > Has anyone successful

Apache Spark on Raspberry Pi Cluster with Docker

2015-10-28 Thread Mark Bonnekessel
Hi, we are trying to setup apache spark on a raspberry pi cluster for educational use. Spark is installed in a docker container and all necessary ports are exposed. After we start master and workers, all workers are listed as alive in the master web ui (http://master:8080

Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
Hello. I am working to get a simple solution working using Spark SQL. I am writing streaming data to persistent tables using a HiveContext. Writing to a persistent non-partitioned table works well - I update the table using Spark streaming, and the output is available via Hive Thrift/JDBC. I

Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-28 Thread Colin Alstad
We recently switched to Spark 1.5.0 from 1.4.1 and have noticed some inconsistent behavior in persisting DataFrames. df1 = sqlContext.read.parquet(“df1.parquet”) df1.count() > 161,100,982 df2 = sqlContext.read.parquet(“df2.parquet”) df2.count() > 67,498,706 join_df = df1.join(df2, ‘id’)

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but

org.apache.spark.shuffle.FetchFailedException: Failed to connect to ..... on worker failure

2015-10-28 Thread kundan kumar
Hi, I am running a Spark Streaming Job. I was testing the fault tolerance by killing one of the workers using the kill -9 command. What I understand is, when I kill a worker the process should not die and resume the execution. But, I am getting the following error and my process is halted.

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

2015-10-28 Thread Tathagata Das
Yeah, of course. Just create an RDD from jdbc, call cache()/persist(), then force it to be evaluated using something like count(). Once it is cached, you can use it in a StreamingContext. Because of the cache it should not access JDBC any more. On Tue, Oct 27, 2015 at 12:04 PM, diplomatic Guru

Re: Spark-Testing-Base Q/A

2015-10-28 Thread Holden Karau
And now (before 1am California time :p) there is a new version of spark-testing base which adds a java base class for streaming tests. I noticed you were using 1.3 so I put in the effort to make this release for Spark 1.3 to 1.5 inclusive). On Wed, Oct 21, 2015 at 4:16 PM, Holden Karau

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

2015-10-28 Thread Tathagata Das
However, if your executor dies. Then it may reconnect to JDBC to reconstruct the RDD partitions that were lost. To prevent that you can checkpoint the RDD to a HDFS-like filesystem (using rdd.checkpoint()). Then you are safe, it wont reconnect to JDBC. On Tue, Oct 27, 2015 at 11:17 PM, Tathagata

RE: SPARKONHBase checkpointing issue

2015-10-28 Thread Amit Hora
Thanks for sharing the link.Yes I understand that accumulators and broadcast variables state are not recovered from checkpoint but is there any way by which I can say that the HBaseContext in this context should nt be recovered from checkpoint rather must be reinitialized -Original

RE: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-28 Thread Cheng, Hao
Hi Jerry, I’ve filed a bug in jira, and also the fixing https://issues.apache.org/jira/browse/SPARK-11364 It will be great appreciated if you can verify the PR with your case. Thanks, Hao From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Wednesday, October 28, 2015 8:51 AM To: Jerry Lam;

Re: SPARKONHBase checkpointing issue

2015-10-28 Thread Tathagata Das
Yes, the workaround is the same that has been suggested in the JIRA for accumulator and broadcast variables. Basically make a singleton object which lazily initializes the HBaseContext. Because of singleton, it wont get serialized through checkpoint. After recovering, it will be reinitialized

Re: Building spark-1.5.x and MQTT

2015-10-28 Thread Ted Yu
Using your command, I did get: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.5.5:single (test-jar-with-dependencies) on project spark-streaming-mqtt_2.10: Failed to create assembly: Error creating assembly archive test-jar-with-dependencies: Problem creating jar:

Spark/Kafka Streaming Job Gets Stuck

2015-10-28 Thread Afshartous, Nick
Hi, we are load testing our Spark 1.3 streaming (reading from Kafka) job and seeing a problem. This is running in AWS/Yarn and the streaming batch interval is set to 3 minutes and this is a ten node cluster. Testing at 30,000 events per second we are seeing the streaming job get stuck

Re: Building spark-1.5.x and MQTT

2015-10-28 Thread Steve Loughran
> On 28 Oct 2015, at 13:19, Bob Corsaro wrote: > > Has anyone successful built this? I'm trying to determine if there is a > defect in the source package or something strange about my environment. I get > a FileNotFound exception on MQTTUtils.class during the build of the

RE: Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-28 Thread Saif.A.Ellafi
Hi, just a couple cents. are your joining columns StringTypes (id field)? I have recently reported a bug where having inconsistent results when filtering String fields in group operations. Saif From: Colin Alstad [mailto:colin.als...@pokitdok.com] Sent: Wednesday, October 28, 2015 12:39 PM

SparkSQL: What is the cost of DataFrame.registerTempTable(String)? Can I have multiple tables referencing to the same DataFrame?

2015-10-28 Thread Anfernee Xu
Hi, I just want to understand the cost of DataFrame.registerTempTable(String), is it just a trivial operation(like creating a object reference) in master(Driver) JVM? And Can I have multiple tables with different name referencing to the same DataFrame? Thanks -- --Anfernee

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Susan Zhang
Have you tried partitionBy? Something like hiveWindowsEvents.foreachRDD( rdd => { val eventsDataFrame = rdd.toDF() eventsDataFrame.write.mode(SaveMode.Append).partitionBy(" windows_event_time_bin").saveAsTable("windows_event") }) On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
Susan, I did give that a shot -- I'm seeing a number of oddities: (1) 'Partition By' appears only accepts alphanumeric lower case fields. It will work for 'machinename', but not 'machineName' or 'machine_name'. (2) When partitioning with maps included in the data I get odd string conversion

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
All, One issue I'm seeing is that I start the thrift server (for jdbc access) via the following: /spark/spark-1.4.1/sbin/start-thriftserver.sh --master spark://master:7077 --hiveconf "spark.cores.max=2" After about 40 seconds the Thrift server is started and available on default port 1. I

Re: Spark Core Transitive Dependencies

2015-10-28 Thread Deng Ching-Mallete
Hi Furkan, A few examples of libraries that we include are joda time, hbase libraries and spark-kafka (for streaming). We use the maven-assembly-plugin to build our assembly jar, btw. Thanks, Deng On Wed, Oct 28, 2015 at 9:10 PM, Furkan KAMACI wrote: > Hi Deng, > >

NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-28 Thread Zhang, Jingyu
It is not a problem to use JavaRDD.cache() for 200M data (all Objects read form Json Format). But when I try to use DataFrame.cache(), It shown exception in below. My machine can cache 1 G data in Avro format without any problem. 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in

Re: How to implement zipWithIndex as a UDF?

2015-10-28 Thread Benyi Wang
Thanks Michael. I should make my question more clear. This is the data type: StructType(Seq( StructField("uid", LongType), StructField("infos", ArrayType( StructType(Seq( StructType("cid", LongType), StructType("cnt", LongType) )) )) )) I want to explode

newbie trouble submitting java app to AWS cluster I created using spark-ec2 script from spark-1.5.1-bin-hadoop2.6 distribution

2015-10-28 Thread Andy Davidson
Hi I just created new cluster using the spark-c2 script from the spark-1.5.1-bin-hadoop2.6 distribution. The master and slaves seem to be up and running. I am having a heck of time figuring out how to submit apps. As a test I compile the sample JavaSparkPi example. I have copied my jar file to

Re: newbie trouble submitting java app to AWS cluster I created using spark-ec2 script from spark-1.5.1-bin-hadoop2.6 distribution

2015-10-28 Thread Andy Davidson
I forgot to mention. I do not have a preference for the cluster manager. I choose the spark-1.5.1-bin-hadoop2.6 distribution because I want to use hdfs. I assumed this distribution would use yarn. Thanks Andy From: Andrew Davidson Date: Wednesday, October 28,

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
The second issue I'm seeing is an OOM issue when writing partitioned data. I am running Spark 1.4.1, Scala 2.11, Hadoop 2.6.1 & using the Hive libraries packaged with Spark. Spark was compiled using the following: mvn -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive

Re: Spark/Kafka Streaming Job Gets Stuck

2015-10-28 Thread Adrian Tanase
Does it work as expected with smaller batch or smaller load? Could it be that it's accumulating too many events over 3 minutes? You could also try increasing the parallelism via repartition to ensure smaller tasks that can safely fit in working memory. Sent from my iPhone > On 28 Oct 2015, at

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
Hi Bryan, Did you read the email I sent few days ago. There are more issues with partitionBy down the road: https://www.mail-archive.com/user@spark.apache.org/msg39512.html Best Regards, Jerry > On Oct 28, 2015, at 4:52 PM,

RE: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan
Jerry, Thank you for the note. It sounds like you were able to get further than I have been - any insight? Just a Spark 1.4.1 vs Spark 1.5? Regards, Bryan Jeffrey -Original Message- From: "Jerry Lam" Sent: ‎10/‎28/‎2015 6:29 PM To: "Bryan Jeffrey"

How to check whether my Spark Jobs are palatalized or not

2015-10-28 Thread Vinoth Sankar
Hi, I'm reading N(mostly in thousands) no of files and filtering it through Spark based on some criteria. Running Spark Application with two Workers(4 cores each). I enforced parallelism by giving *sparkContext.parallelize(fileList*) in my java code, but didn't any performance improvement. And

Re: [Spark Streaming] Why are some uncached RDDs are growing?

2015-10-28 Thread Tathagata Das
UpdateStateByKey automatically caches its RDDs. On Tue, Oct 27, 2015 at 8:05 AM, diplomatic Guru wrote: > > Hello All, > > When I checked my running Stream job on WebUI, I can see that some RDDs > are being listed that were not requested to be cached. What more is that

Mllib explain feature for tree ensembles

2015-10-28 Thread Eugen Cepoi
Hey, Is there some kind of "explain" feature implemented in mllib for the algorithms based on tree ensembles? Some method to which you would feed in a single feature vector and it would return/print what features contributed to the decision or how much each feature contributed "negatively" and

Re: Mllib explain feature for tree ensembles

2015-10-28 Thread Yanbo Liang
Spark ML/MLlib has provided featureImportances to estimate the importance of each feature. 2015-10-28 18:29 GMT+08:00 Eugen Cepoi : >

Re: Mllib explain feature for tree ensembles

2015-10-28 Thread Eugen Cepoi
I guess I will have to upgrade to spark 1.5, thanks! 2015-10-28 11:50 GMT+01:00 Yanbo Liang : > Spark ML/MLlib has provided featureImportances >

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
Hi Bryan, I think they fixed some memory issues in 1.4 for the partition table implementation. 1.5 does much better in terms of executor memory usage for generating partition tables. However, if your table has over some thousand of partitions, reading the partition could be challenging. it

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Yana Kadiyska
For this issue in particular ( ERROR XSDB6: Another instance of Derby may have already booted the database /spark/spark-1.4.1/metastore_db) -- I think it depends on where you start your application and HiveThriftserver from. I've run into a similar issue running a driver app first, which would

RE: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan
Yana, My basic use-case is that I want to process streaming data, and publish it to a persistent spark table. After that I want to make the published data (results) available via JDBC and spark SQL to drive a web API. That would seem to require two drivers starting separate HiveContexts (one

Collect Column as Array in Grouped DataFrame

2015-10-28 Thread saurfang
Sorry if this functionality already exists or has been asked before, but I'm looking for an aggregate function in SparkSQL that allows me to collect a column into array per group in a grouped dataframe. For example, if I have the following table user, score user1, 1 user2, 2 user1, 3