RE: Spark streaming on spark-standalone/ yarn inside Spring XD

2015-09-17 Thread Vignesh Radhakrishnan
Okay, thanks anyways. Will keep looking into it and revert on this forum if I come across any solution From: Tathagata Das [mailto:t...@databricks.com] Sent: 17 September 2015 02:36 To: Vignesh Radhakrishnan Cc: user@spark.apache.org Subject: Re: Spark streaming on

RE: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-17 Thread Sun, Rui
The existing algorithms operating on R data.frame can't simply operate on SparkR DataFrame. They have to be re-implemented to be based on SparkR DataFrame API. -Original Message- From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] Sent: Thursday, September 17, 2015 3:30 AM To:

Re: Spark Thrift Server JDBC Drivers

2015-09-17 Thread Daniel Haviv
Thank you! On Wed, Sep 16, 2015 at 10:29 PM, Dan LaBar wrote: > I'm running Spark in EMR, and using the JDBC driver provided by AWS > . > Don't know if it will work outside of EMR, but

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Akhil Das
Any kind of changes to the jvm classes will make it fail. By checkpointing the data you mean using checkpoint with updateStateByKey? Here's a similar discussion happened earlier which will clear your doubts i guess

Re: Support of other languages?

2015-09-17 Thread Rahul Palamuttam
Hi, Thank you for both responses. Sun you pointed out the exact issue I was referring to, which is copying,serializing, deserializing, the byte-array between the JVM heap and the worker memory. It also doesn't make sense why the byte-array should be kept on-heap, since the data of the parent

Input parsing time

2015-09-17 Thread Carlos Eduardo Santos
Hi, I only loading a JSON and running one query. I would like to know how much time is spent on reading, decompressing (e.g. bz2 file) and parsing the file before the query begins to execute. I have the impression that all processing time (parsing the input and running the query) is included in

Re: How to convert dataframe to a nested StructType schema

2015-09-17 Thread Hao Wang
Thanks, Terry. This is exactly what I need :) Hao On Tue, Sep 15, 2015 at 8:47 PM, Terry Hole wrote: > Hao, > > For spark 1.4.1, you can try this: > val rowrdd = df.rdd.map(r => Row(Row(r(3)), Row(r(0), r(1), r(2 > val newDF = sqlContext.createDataFrame(rowrdd,

Re: Spark Streaming application code change and stateful transformations

2015-09-17 Thread Adrian Tanase
This section in the streaming guide also outlines a new option – use 2 versions in parallel for a period of time, controlling the draining / transition in the application level. http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code Also – I would not

Spark Web UI + NGINX

2015-09-17 Thread Renato Perini
Hello! I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI. I have 2 machines: 1) Machine A, with a public IP. This machine will be used to access Spark Web UI on the Machine B through its private IP address. 2) Machine B, where Spark is installed (standalone master

Re: Saprk.frame.Akkasize

2015-09-17 Thread Adrian Tanase
Have you reviewed this section of the guide? http://spark.apache.org/docs/latest/programming-guide.html#shared-variables If the dataset is static and you need a copy on all the nodes, you should look at broadcast variables. SQL specific, have you tried loading the dataset using the DataFrame

Re: How to speed up MLlib LDA?

2015-09-17 Thread Marko Asplund
Hi Feynman, I just tried that, but there wasn't a noticeable change in training performance. On the other hand model loading time was reduced to ~ 5 seconds from ~ 2 minutes (now persisted as LocalLDAModel). However, query / prediction time was unchanged. Unfortunately, this is the critical

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
In my understand, here I have only three options to keep the DStream state between redeploys (yes, I'm using updateStateByKey): 1. Use checkpoint. 2. Use my own database. 3. Use both. But none of these options are great: 1. Use checkpoint: I cannot load it after code change. Or I need to keep

Re: Input parsing time

2015-09-17 Thread Adrian Tanase
You’re right – everything is captured under Executor Computing Time if it’s your app code. I know that some people have used custom builds of spark that add more timers – they will show-up nicely in the Spark UI. A more light-weight approach is to time it yourself via some counters /

Saprk.frame.Akkasize

2015-09-17 Thread Angel Angel
Hi, I am running some deep learning algorithm on spark. Example: https://github.com/deeplearning4j/dl4j-spark-ml-examples i am trying to run this example in local mode and its working fine. but when i try to run this example in cluster mode i got following error. Loaded Mnist dataframe:

Re: spark performance - executor computing time

2015-09-17 Thread Adrian Tanase
Something similar happened to our job as well - spark streaming, YARN deployed on AWS. One of the jobs was consistently taking 10–15X longer one one machine. Same data volume, data partitioned really well, etc. Are you running on AWS or on prem? We were assuming that one of the VMs in Amazon

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Adrian Tanase
This section in the streaming guide makes your options pretty clear http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code 1. Use 2 versions in parallel, drain the queue up to a point and strat fresh in the new version, only processing events from

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
Thanks Adrian, the hint of use updateStateByKey with initialRdd helps a lot! Adrian Tanase 于2015年9月17日周四 下午4:50写道: > This section in the streaming guide makes your options pretty clear > >

a document for JDK version testing status

2015-09-17 Thread luohui20001
Hi there, I remembered there was a document showing most versions of JDK using and testing status in global companies' spark clusters.However I couldn't find it in spark website and databricks.Is there anyone who still remember that document and don't mind to provide a link? Thanks.

Error with twitter streaming

2015-09-17 Thread Deepak Subhramanian
I am getting error with twitter streaming with spark 1.4 version and twitter4j 3.0.6. There is another thread which also pointed the error. The error happened after the streaming ran for more than 12 hours. Here is the error log. I will try to use 3.0.3 as per the link below and try it..

Re: [Spark Streaming] Distribute custom receivers evenly across excecutors

2015-09-17 Thread patrizio.munzi
Hi, did you managed to get it working? And do you how this works on spark 1.3? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Distribute-custom-receivers-evenly-across-excecutors-tp6671p24724.html Sent from the Apache Spark User List

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
@Adrian, I am doing collect for debugging purpose. But i have to use foreachRDD so that i can operate on top of this rdd and eventually save to DB. But my actual problem here is to properly convert Array[Byte] to my custom object. On Thu, Sep 17, 2015 at 7:04 PM, Adrian Tanase

Re: NGINX + Spark Web UI

2015-09-17 Thread Ruslan Dautkhanov
Similar setup for Hue http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/ Might give you an idea. -- Ruslan Dautkhanov On Thu, Sep 17, 2015 at 9:50 AM, mjordan79 wrote: > Hello! > I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI. > I have 2

Checkpointing with Kinesis

2015-09-17 Thread Alan Dipert
Hello, We are using Spark Streaming 1.4.1 in AWS EMR to process records from Kinesis. Our Spark program saves RDDs to S3, after which the records are picked up by a Lambda function that loads them into Redshift. That no data is lost during processing is important to us. We have set our Kinesis

Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Rex X
With Pandas dataframe , we can do query: >>> from numpy.random import randn>>> from pandas import DataFrame>>> df = >>> DataFrame(randn(10, 2), columns=list('ab'))>>> df.query('a > b') This SQL-select-like query

Re: How to calculate average from multiple values

2015-09-17 Thread diplomatic Guru
Hi Robin, You are a star! Thank you for the explanation and example. I converted your code into Java without any hassle. It is working as I expected. I carried out the final calculation (5th/6th) using mapValues and it is working nicely. But I was wondering is there a better way to do it other

Re: Checkpointing with Kinesis

2015-09-17 Thread Aniket Bhatnagar
You can perhaps setup a WAL that logs to S3? New cluster should pick the records that weren't processed due previous cluster termination. Thanks, Aniket On Thu, Sep 17, 2015, 9:19 PM Alan Dipert wrote: > Hello, > We are using Spark Streaming 1.4.1 in AWS EMR to process records

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
I guess what I'm asking is why not start with a Byte array like in the example that works (using the DefaultDecoder) then map over it and do the decoding manually like I'm suggesting below. Have you tried this approach? We have the same workflow (kafka => protobuf => custom class) and it

NGINX + Spark Web UI

2015-09-17 Thread mjordan79
Hello! I'm trying to set up a reverse proxy (using nginx) for the Spark Web UI. I have 2 machines: 1) Machine A, with a public IP. This machine will be used to access Spark Web UI on the Machine B through its private IP address. 2) Machine B, where Spark is installed (standalone master cluster, 1

Re: Spark Streaming application code change and stateful transformations

2015-09-17 Thread Cody Koeninger
The reason I'm dismissing the graceful shutdown approach is that if your app crashes, and can't be restarted without code changes (e.g. a bug needs to be fixed), you're screwed. On Thu, Sep 17, 2015 at 3:56 AM, Adrian Tanase wrote: > This section in the streaming guide also

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
Why are you calling foreachRdd / collect in the first place? Instead of using a custom decoder, you should simply do – this is code executed on the workers and allows the computation to continue. ForeachRdd and collect are output operations and force the data to be collected on the driver

Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-17 Thread Ali Hadian
Thansk, but as I know, checkpointing is specific to streaming RDDs and is not implemented in regular RDDs (just inherited from the superclass, but not implemented). How can I checkpoint the intermediate JavaRDDs?? -Original Message- From: Alexis Gillain

Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
I am using KafkaUtils.createDirectStream to read the data from kafka bus. On the producer end, i am generating in the following way: props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers) props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Saisai Shao
Is your "KafkaGenericEvent" serializable? Since you call rdd.collect() to fetch the data to local driver, so this KafkaGenericEvent need to be serialized and deserialized through Java or Kryo (depends on your configuration) serializer, not sure if it is your problem to always get a default object.

How to add sparkSQL into a standalone application

2015-09-17 Thread Cui Lin
Hello, I got stuck in adding spark sql into my standalone application. The build.sbt is defined as: libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.1" I got the following error when building the package: *[error] /data/workspace/test/src/main/scala/TestMain.scala:6: object

Spark data type guesser UDAF

2015-09-17 Thread Ruslan Dautkhanov
Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess. Am I inventing a wheel? Does Spark have something like this already built-in? Would be very useful

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
@Saisai Shao, Thanks for the pointer. It turned out to be the serialization issue. I was using scalabuff to generate my "KafkaGenericEvent" class. But when i went through the generated class code, i figured out that it is not serializable. Now i am generating my classes using scalapb (

Re: Spark monitoring

2015-09-17 Thread Pratham Khanna
Thanks, that worked On Mon, Sep 14, 2015 at 4:54 PM, Akhil Das wrote: > You can write a script to hit the MasterURL:8080/json end point to > retrieve the information. It gives you a response like this: > > > { > "url" : "spark://akhldz:7077", > "workers" : [ { >

SPARK-SQL parameter tuning for performance

2015-09-17 Thread Sadhan Sood
Hi Spark users, We are running Spark on Yarn and often query table partitions as big as 100~200 GB from hdfs. Hdfs is co-located on the same cluster on which Spark and Yarn run. I've noticed a much higher I/O read rates when I increase the number of executors cores from 2 to 8( Most tasks run in

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Michael Armbrust
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.4.1" Though, I would consider using spark-hive and HiveContext, as the query parser is more powerful and you'll have access to window functions and other features. On Thu, Sep 17, 2015 at 10:59 AM, Cui Lin

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Michael Armbrust
You don't need to set anything up, it'll create a local hive metastore by default if you don't explicitly configure one. On Thu, Sep 17, 2015 at 11:45 AM, Cui Lin wrote: > Hi, Michael, > > It works to me! Thanks a lot! > If I use spark-hive or HiveContext, do I have to

Re: Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Michael Armbrust
from pyspark.sql.functions import * ​ df = sqlContext.range(10).select(rand().alias("a"), rand().alias("b")) df.where("a > b").show() (2) Spark Jobs +--+---+ | a| b| +--+---+ |0.6697439215581628|0.23420961030968923|

in joins, does one side stream?

2015-09-17 Thread Koert Kuipers
in scalding we join with the smaller side on the left, since the smaller side will get buffered while the bigger side streams through the join. looking at CoGroupedRDD i do not get the impression such a distiction is made. it seems both sided are put into a map that can spill to disk. is this

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Cui Lin
Hi, Michael, It works to me! Thanks a lot! If I use spark-hive or HiveContext, do I have to setup Hive on server? Can I run this on my local laptop? On Thu, Sep 17, 2015 at 11:02 AM, Michael Armbrust wrote: > libraryDependencies += "org.apache.spark" %% "spark-sql" %

Has anyone used the Twitter API for location filtering?

2015-09-17 Thread Jo Sunad
I've been trying to filter for GeoLocation, Place or even Time Zone and I keep getting null values. I think I got one Place in 20 minutes of the app running (without any filters on tweets). Is this normal? Do I have to try querying rather than filtering? my code is following TD's example... val

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-17 Thread Alan Braithwaite
One other piece of information: We're using zookeeper for persistence and when we brought the dispatcher back online, it crashed on the same exception after loading the config from zookeeper. Cheers, - Alan On Thu, Sep 17, 2015 at 12:29 PM, Alan Braithwaite wrote: > Hey

Re: WAL on S3

2015-09-17 Thread Ted Yu
I assume you don't use Kinesis. Are you running Spark 1.5.0 ? If you must use S3, is switching to Kinesis possible ? Cheers On Thu, Sep 17, 2015 at 1:09 PM, Michal Čizmazia wrote: > How to make Write Ahead Logs to work with S3? Any pointers welcome! > > It seems as a known

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
Good catch! BTW, great choice with ScalaPB, we moved from scalabuff as well, in order to generate the classes at compile time from sbt. Sent from my iPhone On 17 Sep 2015, at 22:00, srungarapu vamsi > wrote: @Saisai Shao, Thanks for

Stopping criteria for gradient descent

2015-09-17 Thread nishanthps
Hi, I am running LogisticRegressionWithSGD in spark 1.4.1 and it always takes 100 iterations to train (which is the default). It never meets the convergence criteria, shouldn't the convergence criteria for SGD be based on difference in logloss or the difference in accuracy on a held out test set

Re: WAL on S3

2015-09-17 Thread Tathagata Das
Actually, the current WAL implementation (as of Spark 1.5) does not work with S3 because S3 does not support flushing. Basically, the current implementation assumes that after write + flush, the data is immediately durable, and readable if the system crashes without closing the WAL file. This does

KafkaDirectStream can't be recovered from checkpoint

2015-09-17 Thread Petr Novak
Hi all, it throws FileBasedWriteAheadLogReader: Error reading next item, EOF reached java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.streaming.util.FileBaseWriteAheadLogReader.hasNext(FileBasedWriteAheadLogReader.scala:47) WAL is not

Creating BlockMatrix with java API

2015-09-17 Thread Pulasthi Supun Wickramasinghe
Hi All, I am new to Spark and i am trying to do some BlockMatrix operations with the Mllib API's. But i can't seem to create a BlockMatrix with the java API. I tried the following Matrix matrixa = Matrices.rand(4, 4, new Random(1000)); List,Matrix>> list = new

selected field not getting pushed down into my DataSource?

2015-09-17 Thread Timothy Potter
I'm using Spark 1.4.1 and am doing the following with spark-shell: solr = sqlContext.read.format("solr").option("zkhost", "localhost:2181").option("collection","spark").load() solr.select("id").count() The Solr DataSource implements PrunedFilteredScan so I expected the buildScan method to get

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-17 Thread Cody Koeninger
Is there a particular reason you're calling checkpoint on the stream in addition to the streaming context? On Thu, Sep 17, 2015 at 2:36 PM, Petr Novak wrote: > Hi all, > it throws FileBasedWriteAheadLogReader: Error reading next item, EOF > reached > java.io.EOFException >

WAL on S3

2015-09-17 Thread Michal Čizmazia
How to make Write Ahead Logs to work with S3? Any pointers welcome! It seems as a known issue: https://issues.apache.org/jira/browse/SPARK-9215 I am getting this exception when reading write ahead log: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure:

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-17 Thread Alan Braithwaite
Hey All, To bump this thread once again, I'm having some trouble using the dispatcher as well. I'm using Mesos Cluster Manager with Docker Executors. I've deployed the dispatcher as Marathon job. When I submit a job using spark submit, the dispatcher writes back that the submission was

Create view on nested JSON doesn't recognize column names

2015-09-17 Thread Dan LaBar
I’m trying to create a view on a nested JSON file (converted to a dict) using PySpark 1.4.1. The SQL looks like this: create view myView asselect myColA, myStruct.ColB, myStruct.nestedColCfrom myTblwhere myColD = "some value"; The select statement by itself runs fine, but when I try to create

Re: WAL on S3

2015-09-17 Thread Michal Čizmazia
Please could you explain how to use pluggable WAL? After I implement the WriteAheadLog abstract class, how can I use it? I want to use it with a Custom Reliable Receiver. I am using Spark 1.4.1. Thanks! On 17 September 2015 at 16:40, Tathagata Das wrote: > Actually, the

Re: WAL on S3

2015-09-17 Thread Tathagata Das
You could override the spark conf called "spark.streaming.receiver.writeAheadLog.class" with the class name. https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogUtils.scala#L30 On Thu, Sep 17, 2015 at 2:04 PM, Michal Čizmazia

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-17 Thread Alan Braithwaite
Small update: I found properties-file spark-submit parameter by reading the code and that seems to work, but appears to be undocumented in the submitting applications doc page. - Alan On Thu, Sep 17, 2015 at 12:39 PM, Alan Braithwaite wrote: > One other piece of

Spark streaming to database exception handling

2015-09-17 Thread david w
I am using spark stream to receive data from kafka, and then write result rdd to external database inside foreachPartition(). All thing works fine, my question is how can we handle no data loss if there is database connection failure, or other exception happened during write data to external

Spark w/YARN Scheduling Questions...

2015-09-17 Thread Robert Saccone
Hello We're running some experiments with Spark (v1.4) and have some questions about its scheduling behavior. I am hoping someone can answer the following questions. What is a task set? It is mentioned in the Spark logs we get from our runs but we can't seem to find a definition and how it

Re: Null Value in DecimalType column of DataFrame

2015-09-17 Thread Yin Huai
As I mentioned before, the range of values of DecimalType(10, 10) is [0, 1). If you have a value 10.5 and you want to cast it to DecimalType(10, 10), I do not think there is any better returned value except of null. Looks like DecimalType(10, 10) is not the right type for your use case. You need a

Re: Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Rex X
very cool! Thank you, Michael. On Thu, Sep 17, 2015 at 11:00 AM, Michael Armbrust wrote: > from pyspark.sql.functions import * > > ​ > > df = sqlContext.range(10).select(rand().alias("a"), rand().alias("b")) > > df.where("a > b").show() > > (2) Spark Jobs >

Cache after filter Vs Writing back to HDFS

2015-09-17 Thread Gavin Yue
For a large dataset, I want to filter out something and then do the computing intensive work. What I am doing now: Data.filter(somerules).cache() Data.count() Data.map(timeintensivecompute) But this sometimes takes unusually long time due to cache missing and recalculation. So I changed to

Distribute JMS receiver jobs on YARN

2015-09-17 Thread nibiau
Hello, I have spark application with a JMS receiver. Basically my application does : JavaDStream incoming_msg = customReceiverStream.map( new Function() { public String

Re: Spark streaming to database exception handling

2015-09-17 Thread Cody Koeninger
If you fail the task (throw an exception) it will be retried On Thu, Sep 17, 2015 at 4:56 PM, david w wrote: > I am using spark stream to receive data from kafka, and then write result > rdd > to external database inside foreachPartition(). All thing works fine, my > question

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
If i understand correctly, i guess you are suggesting me to do this : val kafkaDStream = KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder](ssc, kafkaConf, Set(topics)) kafkaDStream.map{ case(devId,byteArray)

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-17 Thread Luciano Resende
You can find some more info about SparkR at https://spark.apache.org/docs/latest/sparkr.html Looking at your sample app, with the provided content, you should be able to run it on SparkR with something like: #load SparkR with support for csv sparkR --packages com.databricks:spark-csv_2.10:1.0.3

Performance changes quite large

2015-09-17 Thread Gavin Yue
I am trying to parse quite a lot large json files. At the beginning, I am doing like this textFile(path).map(parseJson(line)).count() For each file(800 - 900 Mb), it would take roughtly 1 min to finish. I then changed the code tl val rawData = textFile(path) rawData.cache() rawData.count()

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-17 Thread Timothy Chen
Hi Alan, If I understand correctly, you are setting executor home when you launch the dispatcher and not on the configuration when you submit job, and expect it to inherit that configuration? When I worked on the dispatcher I was assuming all configuration is passed to the dispatcher to

DecisionTree hangs, then crashes

2015-09-17 Thread jluan
See my stack overflow questions for better formatted info: http://stackoverflow.com/questions/32621267/spark-1-5-0-hangs-running-randomforest I am trying to run a basic decision tree from MLLIB. My spark

Re: Spark w/YARN Scheduling Questions...

2015-09-17 Thread Saisai Shao
Task set is a set of tasks within one stage. Executor will be killed when it is idle for a period of time (default is 60s). The problem you mentioned is bug, scheduler should not allocate tasks on this to-be killed executors. I think it is fixed in 1.5. Thanks Saisai On Thu, Sep 17, 2015 at

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-17 Thread Hemant Bhanawat
Driver timing out laggards seems like a reasonable way of handling laggards. Are there any challenges because of which driver does not do it today? Is there a JIRA for this? I couldn't find one. On Tue, Sep 15, 2015 at 12:07 PM, Akhil Das wrote: > As of now i

Re: Best way to merge final output part files created by Spark job

2015-09-17 Thread MEETHU MATHEW
Try coalesce(1) before writing Thanks & Regards, Meethu M On Tuesday, 15 September 2015 6:49 AM, java8964 wrote: #yiv1620377612 #yiv1620377612 --.yiv1620377612hmmessage P{margin:0px;padding:0px;}#yiv1620377612

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-17 Thread Alan Braithwaite
Hi Tim, Thanks for the follow up. It's not so much that I expect the executor to inherit the configuration of the dispatcher as I* don't *expect the dispatcher to make assumptions about the system environment of the executor (since it lives in a docker). I could potentially see a case where you