Spark runs only on Mesos v0.21?

2016-02-12 Thread Petr Novak
Hi all, based on documenation: "Spark 1.6.0 is designed for use with Mesos 0.21.0 and does not require any special patches of Mesos." We are considering Mesos for our use but this concerns me a lot. Mesos is currently on v0.27 which we need for its Volumes feature. But Spark locks us to 0.21

Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Arkadiusz Bicz
Hi, You need good monitoring tools to send you alarms about disk, network or applications errors, but I think it is general dev ops work not very specific to spark or hadoop. BR, Arkadiusz Bicz https://www.linkedin.com/in/arkadiuszbicz On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson

Re: Spark runs only on Mesos v0.21?

2016-02-12 Thread Tamas Szuromi
Hello Petr, We're running Spark 1.5.2 and 1.6.0 on Mesos 0.25.0 without any problem. We upgraded from 0.21.0 originally. cheers, Tamas On 12 February 2016 at 09:31, Petr Novak wrote: > Hi all, > based on documenation: > > "Spark 1.6.0 is designed for use with Mesos

Re: How to parallel read files in a directory

2016-02-12 Thread Arkadiusz Bicz
Hi Junjie, >From my experience HDFS is slow reading large amount of small files as every file come with lot of information from namenode and data nodes. When file size is bellow HDFS default block (usually 64MB or 128MB) size you can not use fully optimizations of Hadoop to read in streamed way

Re: Inserting column to DataFrame

2016-02-12 Thread Zsolt Tóth
Sure. I ran the same job with fewer columns, the exception: java.lang.IllegalArgumentException: requirement failed: DataFrame must have the same schema as the relation to which is inserted. DataFrame schema: StructType(StructField(pixel0,ByteType,true), StructField(pixel1,ByteType,true),

Re: Inserting column to DataFrame

2016-02-12 Thread Zsolt Tóth
Hi, thanks for the answers. If joining the DataFrames is the solution, then why does the simple withColumn() succeed for some datasets and fail for others? 2016-02-11 11:53 GMT+01:00 Michał Zieliński : > I think a good idea would be to do a join: > > outputDF =

Using SPARK packages in Spark Cluster

2016-02-12 Thread Gourav Sengupta
Hi, I am creating sparkcontext in a SPARK standalone cluster as mentioned here: http://spark.apache.org/docs/latest/spark-standalone.html using the following code: -- sc.stop()

Connection via JDBC to Oracle hangs after count call

2016-02-12 Thread Mich Talebzadeh
Hi, I use the following to connect to Oracle DB from Spark shell 1.5.2 spark-shell --master spark://50.140.197.217:7077 --driver-class-path /home/hduser/jars/ojdbc6.jar in Scala I do scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext:

Re: newbie unable to write to S3 403 forbidden error

2016-02-12 Thread Igor Berman
String dirPath = "s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/*json” * not sure, but can you try to remove s3-us-west-1.amazonaws.com from path ? On 11 February 2016 at 23:15, Andy Davidson wrote: > I am

Re: How to parallel read files in a directory

2016-02-12 Thread Jörn Franke
Put many small files in Hadoop Archives (HAR) to improve performance of reading small files. Alternatively have a batch job concatenating them. > On 11 Feb 2016, at 18:33, Junjie Qian wrote: > > Hi all, > > I am working with Spark 1.6, scala and have a big dataset

Re: off-heap certain operations

2016-02-12 Thread Ted Yu
Ovidiu-Cristian: Please see the following JIRA / PR : [SPARK-12251] Document and improve off-heap memory configurations Cheers On Thu, Feb 11, 2016 at 11:06 PM, Sea <261810...@qq.com> wrote: > spark.memory.offHeap.enabled (default is false) , it is wrong in spark > docs. Spark1.6 do not

Re: Inserting column to DataFrame

2016-02-12 Thread Ted Yu
Seems like a bug. Suggest filing an issue with code snippet if this can be reproduced on 1.6 branch. Cheers On Fri, Feb 12, 2016 at 4:25 AM, Zsolt Tóth wrote: > Sure. I ran the same job with fewer columns, the exception: > > java.lang.IllegalArgumentException:

Re: Convert Iterable to RDD

2016-02-12 Thread seb.arzt
I have an Iterator of several million elements, which unfortunately won't fit into the driver memory at the same time. I would like to save them as object file in HDFS: Doing so I am running out of memory on the driver: Using a stream also won't work. I cannot further increase the driver

[SparkML] RandomForestModel save on disk.

2016-02-12 Thread Eugene Morozov
Hello, I'm building simple web service that works with spark and allows users to train random forest model (mlib API) and use it for prediction. Trained models are stored on the local file system (web service and spark of just one worker are run on the same machine). I'm concerned about

Re: Convert Iterable to RDD

2016-02-12 Thread Jerry Lam
Not sure if I understand your problem well but why don't you create the file locally and then upload to hdfs? Sent from my iPhone > On 12 Feb, 2016, at 9:09 am, "seb.arzt" wrote: > > I have an Iterator of several million elements, which unfortunately won't fit > into the

Python3 does not have Module 'UserString'

2016-02-12 Thread Sisyphuss
When trying the `reduceByKey` transformation on Python3.4, I got the following error: ImportError: No module named 'UserString' -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python3-does-not-have-Module-UserString-tp26212.html Sent from the Apache Spark

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-12 Thread p pathiyil
Thanks Sebastian. I was indeed trying out FAIR scheduling with a high value for concurrentJobs today. It does improve the latency seen by the non-hot partitions, even if it does not provide complete isolation. So it might be an acceptable middle ground. On 12 Feb 2016 12:18, "Sebastian Piu"

Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Arkadiusz Bicz
Hi Andy, I suggest to monitor disk usage and in case it is 90% occupation send alarm to your support team to solve problem, you should not allow your production system to go down. Regarding tools you can try set of software as collectd and Spark -> Graphite -> Grafana ->

Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Ted Yu
See this thread for discussion on related subject: http://search-hadoop.com/m/q3RTtjkIOr1gHqFb1/dropping+spark+python+2.6=+discuss+dropping+Python+2+6+support especially comments from Juliet. On Fri, Feb 12, 2016 at 9:01 AM, Zheng Wendell wrote: > I think this may be

Seperate Log4j.xml for Spark and Application JAR ( Application vs Spark )

2016-02-12 Thread Ashish Soni
Hi All , As per my best understanding we can have only one log4j for both spark and application as which ever comes first in the classpath takes precedence , Is there any way we can keep one in application and one in the spark conf folder .. is it possible ? Thanks

Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Zheng Wendell
Sorry, I can no longer reproduce the error. After upgrading Python3.4.2 to Python 3.4.4, the error disappears. Spark release: spark-1.6.0-bin-hadoop2.6 code snippet: ``` lines = sc.parallelize([5,6,2,8,5,2,4,9,2,1,7,3,4,1,5,8,7,6]) pairs = lines.map(lambda x: (x, 1)) counts =

spark-submit: remote protocol vs --py-files

2016-02-12 Thread Jeff Henrikson
Spark users, I am testing different cluster spinup and batch submission jobs. Using the sequenceiq/spark docker package, I have succeeded in submitting "fat egg" (analogous to "fat jar") style python code remotely over YARN. spark-submit --py-files is able to transmit the packaged code to

Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Zheng Wendell
I think this may be also due to the fact that I have multiple copies of Python. My driver program was using Python3.4.2 My local slave nodes are using Python3.4.4 (System administrator's version) On Fri, Feb 12, 2016 at 5:51 PM, Zheng Wendell wrote: > Sorry, I can no

Re: Python3 does not have Module 'UserString'

2016-02-12 Thread Ted Yu
Can you give a bit more information ? release of Spark you use full error trace your code snippet Thanks On Fri, Feb 12, 2016 at 7:22 AM, Sisyphuss wrote: > When trying the `reduceByKey` transformation on Python3.4, I got the > following error: > > ImportError: No

spark slate IP

2016-02-12 Thread Christopher Bourez
Dears, is there a way to bind a slave to the public IP (instead of the private IP) 16/02/12 14:54:03 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160212135403-0009/0 on hostPort 172.31.19.203:39841 with 2 cores, 1024.0 MB RAM thanks, C

Re: Spark Submit

2016-02-12 Thread Ashish Soni
it works as below spark-submit --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml" --conf spark.executor.memory=512m Thanks all for the quick help. On Fri, Feb 12, 2016 at 10:59 AM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Try > spark-submit --conf

Spark Submit

2016-02-12 Thread Ashish Soni
Hi All , How do i pass multiple configuration parameter while spark submit Please help i am trying as below spark-submit --conf "spark.executor.memory=512m spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml" Thanks,

Re: Spark Submit

2016-02-12 Thread Ted Yu
Have you tried specifying multiple '--conf key=value' ? Cheers On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni wrote: > Hi All , > > How do i pass multiple configuration parameter while spark submit > > Please help i am trying as below > > spark-submit --conf

spark-shell throws JDBC error after load

2016-02-12 Thread Mich Talebzadeh
I have resolved the hanging issue below by using yarn-client as follows spark-shell --master yarn --deploy-mode client --driver-class-path /home/hduser/jars/ojdbc6.jar val channels = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

Re: best practices? spark streaming writing output detecting disk full error

2016-02-12 Thread Andy Davidson
Hi Arkadiusz Do you have any suggestions? As an engineer I think when I get disk full errors I want the application to terminate. Its a lot easier for ops to really there is a problem. Andy From: Arkadiusz Bicz Date: Friday, February 12, 2016 at 1:57 AM To:

Re: Spark Submit

2016-02-12 Thread Jacek Laskowski
Or simply multiple -c. Jacek 12.02.2016 4:54 PM "Ted Yu" napisał(a): > Have you tried specifying multiple '--conf key=value' ? > > Cheers > > On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni > wrote: > >> Hi All , >> >> How do i pass multiple

Re: newbie unable to write to S3 403 forbidden error

2016-02-12 Thread Andy Davidson
Hi Igor So I assume you are able to use s3 from spark? Do you use rdd.saveAsTextFile() ? How did you create your cluster? I.E. Did you use the spark-1.6.0/spark-ec2 script, EMR, or something else? I tried several version of the url including no luck :-( The bucket name is Œcom.ps.twitter¹.

Re: Spark Submit

2016-02-12 Thread Diwakar Dhanuskodi
Try  spark-submit  --conf "spark.executor.memory=512m" --conf "spark.executor.extraJavaOptions=x" --conf "Dlog4j.configuration=log4j.xml" Sent from Samsung Mobile. Original message From: Ted Yu Date:12/02/2016 21:24 (GMT+05:30) To: Ashish Soni

Re: [SparkML] RandomForestModel save on disk.

2016-02-12 Thread Eugene Morozov
Here is the exception I discover. java.lang.RuntimeException: error reading Scala signature of org.apache.spark.mllib.tree.model.DecisionTreeModel: scala.reflect.internal.Symbols$PackageClassSymbol cannot be cast to scala.reflect.internal.Constants$Constant at

coalesce and executor memory

2016-02-12 Thread Christopher Brady
Can anyone help me understand why using coalesce causes my executors to crash with out of memory? What happens during coalesce that increases memory usage so much? If I do: hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile everything works fine, but if I do: hadoopFile -> sample

SSE in s3

2016-02-12 Thread Lin, Hao
Hi, Can we configure Spark to enable SSE (Server Side Encryption) for saving files to s3? much appreciated! thanks Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended

Re: off-heap certain operations

2016-02-12 Thread Ovidiu-Cristian MARCU
I found nothing about the certain operations. Still not clear, certain is poor documentation. Can someone give an answer so I can consider using this new release? spark.memory.offHeap.enabled If true, Spark will attempt to use off-heap memory for certain operations. > On 12 Feb 2016, at 13:21,

Allowing parallelism in spark local mode

2016-02-12 Thread yael aharon
Hello, I have an application that receives requests over HTTP and uses spark in local mode to process the requests. Each request is running in its own thread. It seems that spark is queueing the jobs, processing them one at a time. When 2 requests arrive simultaneously, the processing time for

Re: Using SPARK packages in Spark Cluster

2016-02-12 Thread Burak Yavuz
Hello Gourav, The packages need to be loaded BEFORE you start the JVM, therefore you won't be able to add packages dynamically in code. You should use the --packages with pyspark before you start your application. One option is to add a `conf` that will load some packages if you are constantly

RE: Question on Spark architecture and DAG

2016-02-12 Thread Mich Talebzadeh
Thanks Andy much appreciated Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr V8Pw

Re: Allowing parallelism in spark local mode

2016-02-12 Thread Chris Fregly
sounds like the first job is occupying all resources. you should limit the resources that a single job can acquire. fair scheduler is one way to do that. a possibly simpler way is to configured spark.deploy.defaultCores or spark.cores.max? the defaults for these values - for the Spark default

GroupedDataset flatMapGroups with sorting (aka secondary sort redux)

2016-02-12 Thread Koert Kuipers
is there a way to leverage the shuffle in Dataset/GroupedDataset so that Iterator[V] in flatMapGroups has a well defined ordering? is hard for me to see many good use cases for flatMapGroups and mapGroups if you do not have sorting. since spark has a sort based shuffle not exposing this would be

Dataset takes more memory compared to RDD

2016-02-12 Thread Raghava Mutharaju
Hello All, I implemented an algorithm using both the RDDs and the Dataset API (in Spark 1.6). Dataset version takes lot more memory than the RDDs. Is this normal? Even for very small input data, it is running out of memory and I get a java heap exception. I tried the Kryo serializer by

_metada file throwing an "GC overhead limit exceeded" after a write

2016-02-12 Thread Maurin Lenglart
Hi, I am currently using spark in python. I have my master, worker and driver on the same machine in different dockers. I am using spark 1.6. The configuration that I am using look like this : CONFIG["spark.executor.memory"] = "100g" CONFIG["spark.executor.cores"] = "11"

pyspark.DataFrame.dropDuplicates

2016-02-12 Thread James Barney
Hi all, Just wondering what the actual logic governing DataFrame.dropDuplicates() is? For example: >>> from pyspark.sql import Row >>> df = sc.parallelize([ \ Row(name='Alice', age=5, height=80, itemsInPocket=[pen, pencil, paper]), \ Row(name='Alice', age=5, height=80), itemsInPocket=[pen,

Re: Allowing parallelism in spark local mode

2016-02-12 Thread Silvio Fiorito
You’ll want to setup the FAIR scheduler as described here: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application From: yael aharon > Date: Friday, February 12, 2016 at 2:00 PM To:

Re: coalesce and executor memory

2016-02-12 Thread Silvio Fiorito
Coalesce essentially reduces parallelism, so fewer cores are getting more records. Be aware that it could also lead to loss of data locality, depending on how far you reduce. Depending on what you’re doing in the map operation, it could lead to OOM errors. Can you give more details as to what

Re: coalesce and executor memory

2016-02-12 Thread Koert Kuipers
in spark, every partition needs to fit in the memory available to the core processing it. as you coalesce you reduce number of partitions, increasing partition size. at some point the partition no longer fits in memory. On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito <

Re: Question on Spark architecture and DAG

2016-02-12 Thread Andy Davidson
From: Mich Talebzadeh Date: Thursday, February 11, 2016 at 2:30 PM To: "user @spark" Subject: Question on Spark architecture and DAG > Hi, > > I have used Hive on Spark engine and of course Hive tables and its pretty > impressive comparing Hive

Spark with DF throws No suitable driver found for jdbc:oracle: after first call

2016-02-12 Thread Mich Talebzadeh
First I put the Oracle JAR file in spark-shell start up and also in CLASSPATH spark-shell --master yarn --deploy-mode client --driver-class-path /home/hduser/jars/ojdbc6.jar Now it shows clearly that load call is successful as shown in bold so it can use the driver. However, the next

new to Spark - trying to get a basic example to run - could use some help

2016-02-12 Thread Taylor, Ronald C
Hello folks, This is my first msg to the list. New to Spark, and trying to run the SparkPi example shown in the Cloudera documentation. We have Cloudera 5.5.1 running on a small cluster at our lab, with Spark 1.5. My trial invocation is given below. The output that I get *says* that I

Re: coalesce and executor memory

2016-02-12 Thread Christopher Brady
Thank you for the responses. The map function just changes the format of the record slightly, so I don't think that would be the cause of the memory problem. So if I have 3 cores per executor, I need to be able to fit 3 partitions per executor within whatever I specify for the executor

org.apache.spark.sql.AnalysisException: undefined function lit;

2016-02-12 Thread Andy Davidson
I am trying to add a column with a constant value to my data frame. Any idea what I am doing wrong? Kind regards Andy DataFrame result = Š String exprStr = "lit(" + time.milliseconds()+ ") as ms"; logger.warn("AEDWIP expr: {}", exprStr); result.selectExpr("*", exprStr).show(false);

Re: Computing hamming distance over large data set

2016-02-12 Thread Charlie Hack
I ran across DIMSUM a while ago but never used it. https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html Annoy is wonderful if you want to make queries. If you want to do the "self similarity join" you might look at DIMSUM or preferably if at all

How to write Array[Byte] as JPG file in Spark?

2016-02-12 Thread Liangzhao Zeng
Hello All I have RDD[(id:String, image:Array[Byte])] and would like to write the image attribute as a jpg file into HDFS. Any suggestions? Cheers, LZ

Re: Computing hamming distance over large data set

2016-02-12 Thread Maciej Szymkiewicz
There is also this: https://github.com/soundcloud/cosine-lsh-join-spark On 02/11/2016 10:12 PM, Brian Morton wrote: > Karl, > > This is tremendously useful. Thanks very much for your insight. > > Brian > > On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley

Sharing temporary table

2016-02-12 Thread max.tenerowicz
This video suggests that registerTempTable can be used to share table between sessions. Is it Databricks platform specific feature or can do something like this in general? Best, Max -- View this message in context:

Dataset GroupedDataset.reduce

2016-02-12 Thread Koert Kuipers
i see that currently GroupedDataset.reduce simply calls flatMapgroups. does this mean that there is currently no partial aggregation for reduce?

Spark jobs run extremely slow on yarn cluster compared to standalone spark

2016-02-12 Thread pdesai
Hi there, I am doing a POC with Spark and I have noticed that if I run my job on standalone spark installation, it finishes in a second(It's a small sample job). But when I run same job on spark cluster with Yarn, it takes 4-5 min in simple execution. Are there any best practices that I need to

support vector machine does not classify properly?

2016-02-12 Thread prem09
Hi, I created a dataset of 100 points, ranging from X=1.0 to to X=100.0. I let the y variable be 0.0 if X < 51.0 and 1.0 otherwise. I then fit a SVMwithSGD. When I predict the y values for the same values of X as in the sample, I get back 1.0 for each predicted y! Incidentally, I don't get