Re: Enforcing shuffle hash join

2016-07-04 Thread Sun Rui
You can try set “spark.sql.join.preferSortMergeJoin” cons option to false. For detailed join strategies, take a look at the source code of SparkStrategies.scala: /** * Select the proper physical plan for join based on joining keys and size of logical plan. * * At first, uses the

Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro
What's the query? On Tue, Jul 5, 2016 at 2:28 PM, Lalitha MV wrote: > It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is > set to -1, or when the size of the small table is more than spark.sql. > spark.sql.autoBroadcastJoinThreshold. > > On Mon, Jul 4,

Re: Enforcing shuffle hash join

2016-07-04 Thread Lalitha MV
It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to -1, or when the size of the small table is more than spark.sql.spark.sql. autoBroadcastJoinThreshold. On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro wrote: > The join selection can be

Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro
The join selection can be described in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92 . If you have join keys, you can set -1 at `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then, hash joins are

Spark Dataframe validating column names

2016-07-04 Thread Scott W
Hello, I'm processing events using Dataframes converted from a stream of JSON events (Spark streaming) which eventually gets written out as as Parquet format. There are different JSON events coming in so we use schema inference feature of Spark SQL The problem is some of the JSON events contains

Re: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread Yanbo Liang
Would you mind to file a JIRA to track this issue? I will take a look when I have time. 2016-07-04 14:09 GMT-07:00 mshiryae : > Hi, > > I am trying to train model by MultilayerPerceptronClassifier. > > It works on sample data from >

Re: How to handle update/deletion in Structured Streaming?

2016-07-04 Thread Tathagata Das
Input datasets which represent a input data stream only supports appending of new rows, as the stream is modeled as an unbounded table where new data in the stream are new rows being appended to the table. For transformed datasets generated from the input dataset, rows can be updated and removed

Pregel algorithm edge direction docs

2016-07-04 Thread svjk24
Hi, I'm looking through the Pregel algorithm Scaladocs and least in 1.5.1 there seems to me some contradiction between the specification for and : activeDirection the direction of edges incident to a vertex that received a message in the previous round on which to run|sendMsg|. For

Re: java.io.FileNotFoundException

2016-07-04 Thread kishore kumar
Found some warn and error messages in driver log file: 2016-07-04 04:49:50,106 [main] WARN DataNucleus.General- Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR version s of the same plugin in the classpath. The URL

Re: java.io.FileNotFoundException

2016-07-04 Thread kishore kumar
Find the log from rm below, before FNFE there is no earlier errors in driver log, 16/07/04 00:27:56 INFO mapreduce.TableInputFormatBase: Input split length: 0 bytes. 16/07/04 00:27:56 INFO executor.Executor: Executor is trying to kill task 56.0 in stage 2437.0 (TID 328047) 16/07/04 00:27:56 INFO

Re: log traces

2016-07-04 Thread Jean Georges Perrin
Hi Luis, Right... I managed all my Spark "things" through Maven, bu that I mean I have a pom.xml with all the dependencies in it. Here it is: http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0

Re: log traces

2016-07-04 Thread Jean Georges Perrin
Hey Anupam, Thanks... but no: I tried: SparkConf conf = new SparkConf().setAppName("my app").setMaster("local"); JavaSparkContext javaSparkContext = new JavaSparkContext(conf); javaSparkContext.setLogLevel("WARN"); SQLContext

Re: log traces

2016-07-04 Thread Anupam Bhatnagar
Hi Jean, How about using sc.setLogLevel("WARN") ? You may add this statement after initializing the Spark Context. >From the Spark API - "Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN". Here's the link in the Spark API.

Re: log traces

2016-07-04 Thread Jean Georges Perrin
Thanks Mich, but what is SPARK_HOME when you run everything through Maven? > On Jul 4, 2016, at 5:12 PM, Mich Talebzadeh wrote: > > check %SPARK_HOME/conf > > copy file log4j.properties.template to log4j.properties > > edit log4j.properties and set the log levels to

Re: log traces

2016-07-04 Thread Mich Talebzadeh
check %SPARK_HOME/conf copy file log4j.properties.template to log4j.properties edit log4j.properties and set the log levels to your needs cat log4j.properties # Set everything to be logged to the console log4j.rootCategory=ERROR, console log4j.appender.console=org.apache.log4j.ConsoleAppender

Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread mshiryae
Hi, I am trying to train model by MultilayerPerceptronClassifier. It works on sample data from data/mllib/sample_multiclass_classification_data.txt with 4 features, 3 classes and layers [4, 4, 3]. But when I try to use other input files with other features and classes (from here for example:

log traces

2016-07-04 Thread Jean Georges Perrin
Hi, I have installed Apache Spark via Maven. How can I control the volume of log it displays on my system? I tried different location for a log4j.properties, but none seems to work for me. Thanks for help... - To unsubscribe

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
well this will be apparent from the Environment tab of GUI. It will show how the job is actually running. Jacek's point is correct. I suspect this is actually running in Local mode as it looks consuming all from the master node. HTH Dr Mich Talebzadeh LinkedIn *

pyspark aggregate vectors from onehotencoder

2016-07-04 Thread Sebastian Kuepers
hey, what is best practice to aggregate the vectors from onehotencoders in pyspark? udafs are still not available in python. is there any way to do it with spark sql? or do you have to switch to rdds and do it with a reduceByKey for example? thanks, sebastian

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jacek Laskowski
On Mon, Jul 4, 2016 at 8:36 PM, Mathieu Longtin wrote: > Are you using a --master argument, or equivalent config, when calling > spark-submit? > > If you don't, it runs in standalone mode. > s/standalone/local[*] Jacek

Re: Enforcing shuffle hash join

2016-07-04 Thread Lalitha MV
Hi maropu, Thanks for your reply. Would it be possible to write a rule for this, to make it always pick shuffle hash join, over other join implementations(i.e. sort merge and broadcast)? Is there any documentation demonstrating rule based transformation for physical plan trees? Thanks, Lalitha

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mathieu Longtin
Are you using a --master argument, or equivalent config, when calling spark-submit? If you don't, it runs in standalone mode. On Mon, Jul 4, 2016 at 2:27 PM Jakub Stransky wrote: > Hi Mich, > > sure that workers are mentioned in slaves file. I can see them in spark >

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
OK spark-submit by default starts its GUI at port :4040. You can change that using --conf "spark.ui.port=" or any other port. In GUI what do you see under Environment and Executors tabs. Can you send the snapshot? HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hi Mich, sure that workers are mentioned in slaves file. I can see them in spark master UI and even after start they are "blocked" for this application but the cpu and memory consumption is close to nothing. Thanks Jakub On 4 July 2016 at 18:36, Mich Talebzadeh

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
I've found a workaround. I set up an http server serving the jar, and pointed to the http url in spark submit. Which brings me to ask would it be a good option to allow spark-submit to upload a local jar to the master, which the master can then serve via an http interface? The master

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
Silly question. Have you added your workers to sbin/slaves file and have you started start-slaves.sh. on master node when you type jps what do you see? The problem seems to be that workers are ignored and spark is essentially running in Local mode HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Mathieu, there is no rocket science there. Essentially creates dataframe and then call fit from ML pipeline. The thing which I do not understand is how the parallelization is done in terms of ML algorithm. Is it based on parallel factor of the dataframe? Because ML algorithm doesn't offer such

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hi Mich, I have set up spark default configuration in conf directory spark-defaults.conf where I specify master hence no need to put it in command line spark.master spark://spark.master:7077 the same applies to driver memory which has been increased to 4GB and the same is for

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mathieu Longtin
When the driver is running out of memory, it usually means you're loading data in a non parallel way (without using RDD). Make sure anything that requires non trivial amount of memory is done by an RDD. Also, the default memory for everything is 1GB, which may not be enough for you. On Mon, Jul

Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
Try to figure out what the env vars and arguments of the worker JVM and Python process are. Maybe you'll get a clue. On Mon, Jul 4, 2016 at 11:42 AM Mathieu Longtin wrote: > I started with a download of 1.6.0. These days, we use a self compiled > 1.6.2. > > On Mon, Jul

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
Hi Jakub, In standalone mode Spark does the resource management. Which version of Spark are you running? How do you define your SparkConf() parameters for example setMaster etc. From spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp SparkPOC.jar 10 4.3 I did not see any

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Thanks. I'll try that. Hopefully that should work. On Mon, Jul 4, 2016 at 9:12 PM, Mathieu Longtin wrote: > I started with a download of 1.6.0. These days, we use a self compiled > 1.6.2. > > On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav > wrote:

Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
I started with a download of 1.6.0. These days, we use a self compiled 1.6.2. On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav wrote: > I am thinking of any possibilities as to why this could be happening. If > the cores are multi-threaded, should that affect the daemons?

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
I am thinking of any possibilities as to why this could be happening. If the cores are multi-threaded, should that affect the daemons? Your spark was built from source code or downloaded as a binary, though that should not technically change anything? On Mon, Jul 4, 2016 at 9:03 PM, Mathieu

Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
1.6.1. I have no idea. SPARK_WORKER_CORES should do the same. On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav wrote: > Which version of Spark are you using? 1.6.1? > > Any ideas as to why it is not working in ours? > > On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin

Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hello, I have a spark cluster consisting of 4 nodes in a standalone mode, master + 3 workers nodes with configured available memory and cpus etc. I have an spark application which is essentially a MLlib pipeline for training a classifier, in this case RandomForest but could be a DecesionTree

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Which version of Spark are you using? 1.6.1? Any ideas as to why it is not working in ours? On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin wrote: > 16. > > On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav > wrote: > >> Hi, >> >> I tried what you

How to handle update/deletion in Structured Streaming?

2016-07-04 Thread Arnaud Bailly
Hello, I am interested in using the new Structured Streaming feature of Spark SQL and am currently doing some experiments on code at HEAD. I would like to have a better understanding of how deletion should be handled in a structured streaming setting. Given some incremental query computing an

Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
16. On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav wrote: > Hi, > > I tried what you suggested and started the slave using the following > command: > > start-slave.sh --cores 1 > > But it still seems to start as many pyspark daemons as the number of cores > in the node (1

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Hi, I tried what you suggested and started the slave using the following command: start-slave.sh --cores 1 But it still seems to start as many pyspark daemons as the number of cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh file by giving SPARK_WORKER_CORES=1 also

Re: Spark driver assigning splits to incorrect workers

2016-07-04 Thread Raajen Patel
Hi Ted, Perhaps this might help? Thanks for your response. I am trying to access/read binary files stored over a series of servers. Line used to build RDD: val BIN_pairRDD: RDD[(BIN_Key, BIN_Value)] = spark.newAPIHadoopFile("not.used", classOf[BIN_InputFormat], classOf[BIN_Key],

Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
It depends on what you want to do: If, on any given server, you don't want Spark to use more than one core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh --cores=1 If you have a bunch of servers dedicated to Spark, but you don't want a driver to use more than one core per server,

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Hi Mathieu, Isn't that the same as setting "spark.executor.cores" to 1? And how can I specify "--cores=1" from the application? On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin wrote: > When running the executor, put --cores=1. We use this and I only see 2 > pyspark

Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
When running the executor, put --cores=1. We use this and I only see 2 pyspark process, one seem to be the parent of the other and is idle. In your case, are all pyspark process working? On Mon, Jul 4, 2016 at 3:15 AM ar7 wrote: > Hi, > > I am currently using PySpark

Re: Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-04 Thread Pedro Rodriguez
Just realized I had been replying back to only Takeshi. Thanks for tip as it got me on the right track. Running into an issue with private [spark] methods though. It looks like the input metrics start out as None and are not initialized (verified by throwing new Exception on pattern match

Specifying Fixed Duration (Spot Block) for AWS Spark EC2 Cluster

2016-07-04 Thread nsharkey
When I spin up an AWS Spark cluster per the Spark EC2 script: According to AWS: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html#fixed-duration-spot-instances there is a way of reserving for a fixed duration Spot cluster through AWSCLI and the web portal but I can't find

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-04 Thread Lars Albertsson
I created such a setup for a client a few months ago. It is pretty straightforward, but it can take some work to get all the wires connected. I suggest that you start with the spotify/kafka (https://github.com/spotify/docker-kafka) Docker image, since it includes a bundled zookeeper. The

Re: java.io.FileNotFoundException

2016-07-04 Thread Jacek Laskowski
Can you share some stats from Web UI just before the failure? Any earlier errors before FNFE? Jacek On 4 Jul 2016 12:34 p.m., "kishore kumar" wrote: > @jacek: It is running on yarn-client mode, our code don't support running > in yarn-cluster mode and the job is running

Re: java.io.FileNotFoundException

2016-07-04 Thread kishore kumar
@jacek: It is running on yarn-client mode, our code don't support running in yarn-cluster mode and the job is running for around an hour and giving the exception. @karhi: yarn application status is successful, resourcemanager logs did not give any failure info except 16/07/04 00:27:57 INFO

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
Hi Lohith,Thanks for the response. The S3 bucket does have access restrictions, but the instances in which the Spark master and workers run have an IAM role policy that allows them access to it. As such, we don't really configure the cli with credentials...the IAM roles take care of that. Is

Re: java.io.FileNotFoundException

2016-07-04 Thread Jacek Laskowski
Hi, You seem to be using yarn. Is this cluster or client deploy mode? Have you seen any other exceptions before? How long did the application run before the exception? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Lohith Samaga M
Hi, The aws CLI already has your access key aid and secret access key when you initially configured it. Is your s3 bucket without any access restrictions? Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Ashic Mahtab

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
Sorry to do this...but... *bump* From: as...@live.com To: user@spark.apache.org Subject: Cluster mode deployment from jar in S3 Date: Fri, 1 Jul 2016 17:45:12 +0100 Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs using "--deploy-mode client", however using

Re: Graphframe Error

2016-07-04 Thread Felix Cheung
It looks like either the extracted Python code is corrupted or there is a mismatch Python version. Are you using Python 3? stackoverflow.com/questions/514371/whats-the-bad-magic-number-error On Mon, Jul 4, 2016 at

Re: java.io.FileNotFoundException

2016-07-04 Thread karthi keyan
kishore, Could you please post the application master logs ??? will get us more in details. Best, Karthik On Mon, Jul 4, 2016 at 2:27 PM, kishore kumar wrote: > We've upgraded spark version from 1.2 to 1.6 still the same problem, > > Exception in thread "main"

Re: java.io.FileNotFoundException

2016-07-04 Thread kishore kumar
We've upgraded spark version from 1.2 to 1.6 still the same problem, Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 286 in stage 2397.0 failed 4 times, most recent failure: Lost task 286.3 in stage 2397.0 (TID 314416, salve-06.domain.com):

Re: Graphframe Error

2016-07-04 Thread Yanbo Liang
Hi Arun, The command bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6 will automatically load the required graphframes jar file from maven repository, it was not affected by the location where the jar file was placed. Your examples works well in my laptop. Or you can use try with

Limiting Pyspark.daemons

2016-07-04 Thread ar7
Hi, I am currently using PySpark 1.6.1 in my cluster. When a pyspark application is run, the load on the workers seems to go more than what was given. When I ran top, I noticed that there were too many Pyspark.daemons processes running. There was another mail thread regarding the same:

ORC or parquet with Spark

2016-07-04 Thread Ashok Kumar
With Spark caching which file format is best to use parquet or ORC Obviously ORC can be used with Hive.  My question is whether Spark can use various file, stripe rowset statistics stored in ORC file? Otherwise to me both parquet and ORC are files simply kept on HDFS. They do not offer any