Re: [SparkSQL] too many open files although ulimit set to 1048576

2017-03-13 Thread darin
I think your sets not works try add `ulimit -n 10240 ` in spark-env.sh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-too-many-open-files-although-ulimit-set-to-1048576-tp28490p28491.html Sent from the Apache Spark User List mailing list archive

Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-13 Thread Hyukjin Kwon
Hi, all the options are documented in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter It seems we don't have both options for writing. If the goal is trimming the whitespaces, I think we could do this within dataframe operations (as we talked in the

Re: Monitoring ongoing Spark Job when run in Yarn Cluster mode

2017-03-13 Thread Marcelo Vanzin
It's linked from the YARN RM's Web UI (see the "Application Master" link for the running application). On Mon, Mar 13, 2017 at 6:53 AM, Sourav Mazumder wrote: > Hi, > > Is there a way to monitor an ongoing Spark Job when running in Yarn Cluster > mode ? > > In my

Re: Monitoring ongoing Spark Job when run in Yarn Cluster mode

2017-03-13 Thread Nirav Patel
I think it would be on port 4040 by default on the Node where driver is running. You should be able to navigate to that via Resource Manager's application master link as in cluster mode both AM and driver runs on same node. On Mon, Mar 13, 2017 at 6:53 AM, Sourav Mazumder <

DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-13 Thread Nirav Patel
Hi, Is there a document for each datasource (csv, tsv, parquet, json, avro) with available options ? I need to find one for csv to "ignoreLeadingWhiteSpace" and "ignoreTrailingWhiteSpace" Thanks -- [image: What's New with Xactly]

Online learning of LDA model in Spark (update an existing model)

2017-03-13 Thread matd
Hi folks, I would like to train an LDA model in an online fashion, ie. be able to update the resulting model with new documents as they are available. I understand that, under the hood, an online algo is implemented in OnlineLDAOptimizer, but don't understand from the API how I can update an

Java Examples @ Spark github

2017-03-13 Thread Mina Aslani
Hi, When I go github and check the java examples @ https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples, they do not look like to be updated with the latest spark (e.g. spark 2.11). Do you know by any chance where I can find the java examples for spark

Java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem

2017-03-13 Thread Mina Aslani
Hi, I get IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem on the specified below line: String master = "spark://:7077"; SparkConf sparkConf = new SparkConf()

Re: Structured Streaming - Can I start using it?

2017-03-13 Thread Michael Armbrust
I think its very very unlikely that it will get withdrawn. The primary reason that the APIs are still marked experimental is that we like to have several releases before committing to interface stability (in particular the interfaces to write custom sources and sinks are likely to evolve). Also,

Structured Streaming - Can I start using it?

2017-03-13 Thread Gaurav1809
I read in spark documentation that Structured Streaming is still ALPHA in Spark 2.1 and the APIs are still experimental. Shall I use it to re write my existing spark streaming code? Looks like it is not yet production ready. What happens if Structured Streaming project gets withdrawn? -- View

Monitoring ongoing Spark Job when run in Yarn Cluster mode

2017-03-13 Thread Sourav Mazumder
Hi, Is there a way to monitor an ongoing Spark Job when running in Yarn Cluster mode ? In my understanding in Yarn Cluster mode Spark Monitoring UI for the ongoing job would not be available in 4040 port. So is there an alternative ? Regards, Sourav

Re: Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-13 Thread Dhanesh Padmanabhan
Also looks like you need to scale down the regularization for Linear Regression by 1/2n since the loss function is scaled by 1/2n (refer the API documentation for Linear Regression). I was able to get close enough results after this modification. --spark-ml code-- val linearModel = new

Re: Sorted partition ranges without overlap

2017-03-13 Thread Yong Zhang
You can implement your own partitioner based on your own logic. Yong From: Kristoffer Sjögren Sent: Monday, March 13, 2017 9:34 AM To: user Subject: Sorted partition ranges without overlap Hi I have a RDD that needs to be sorted

Re: keep or remove sc.stop() coz of RpcEnv already stopped error

2017-03-13 Thread Alex
Hi , I am using spark-1.6 how to ignore this warning because of this Illegal state exception my production jobs which are scheduld are showing completed abnormally... I cant even handle exception as after sc.stop if i try to execute any code again this exception will come from catch block.. so i

Re: Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-13 Thread Dhanesh Padmanabhan
[Edit] I got few details wrong in my eagerness to reply: 1. Spark uses the corrected standard deviation with sqrt(n-1), and scikit uses the one with sqrt(n). 2. You should scale down the regularization by sum of weights, in case you have a column of weights. When there are no weights, it is

Sorted partition ranges without overlap

2017-03-13 Thread Kristoffer Sjögren
Hi I have a RDD that needs to be sorted lexicographically and then processed by partition. The partitions should be split in to ranged blocks where sorted order is maintained and each partition containing sequential, non-overlapping keys. Given keys (1,2,3,4,5,6) 1. Correct - 2

Re: keep or remove sc.stop() coz of RpcEnv already stopped error

2017-03-13 Thread Yong Zhang
What version of Spark you are using? Based on Spark-12967, it is fixed on Spark 2.0 and later. If you are using Spark 1.x, you can ignore this Warning. It shouldn't affect any functions. Yong From: nancy henry Sent: Monday, March

Re: org.apache.spark.SparkException: Task not serializable

2017-03-13 Thread Yong Zhang
In fact, I will suggest different way to handle the originally problem. The example listed originally comes with a Java Function doesn't use any instance fields/methods, so serializing the whole class is a overkill solution. Instead, you can/should make the Function static, which will work in

Re: Spark and continuous integration

2017-03-13 Thread Sam Elamin
Hi Jorn Thanks for the prompt reply, really we have 2 main concerns with CD, ensuring tests pasts and linting on the code. I think all platforms should handle this with ease, I was just wondering what people are using. Jenkins seems to have the best spark plugins so we are investigating that as

Re: Spark and continuous integration

2017-03-13 Thread Jörn Franke
Hi, Jenkins also now supports pipeline as code and multibranch pipelines. thus you are not so dependent on the UI and you do not need anymore a long list of jobs for different branches. Additionally it has a new UI (beta) called blueocean, which is a little bit nicer. You may also check GoCD.

Adding metrics to spark datasource

2017-03-13 Thread AssafMendelson
Hi, I am building a data source so I can convert a custom source to dataframe. I have been going over examples such as JDBC and noticed that JDBC does the following: val inputMetrics = context.taskMetrics().inputMetrics and whenever a new record is added: inputMetrics.incRecordsRead(1)

Re: Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-13 Thread Dhanesh Padmanabhan
Hi Frank Thanks for this question. I have been comparing logistic regression in sklearn with spark mllib as well. Your example code gave me a perfect way to compare what is going on in both the packages. I looked at both the source codes. There are quite a few differences in how the model

Re: how to construct parameter for model.transform() from datafile

2017-03-13 Thread jinhong lu
Anyone help? > 在 2017年3月13日,19:38,jinhong lu 写道: > > After train the mode, I got the result look like this: > > > scala> predictionResult.show() > > +-++++--+ > |label|

Re: how to construct parameter for model.transform() from datafile

2017-03-13 Thread jinhong lu
After train the mode, I got the result look like this: scala> predictionResult.show() +-++++--+ |label|features| rawPrediction| probability|prediction|

keep or remove sc.stop() coz of RpcEnv already stopped error

2017-03-13 Thread nancy henry
Hi Team, getting this error if we put sc.stop() in application.. can we remove it from application but i read if yu dont explicitly stop using sc.stop the yarn application will not get registered in history service.. SO what to do ? WARN Dispatcher: Message RemoteProcessDisconnected

Spark and continuous integration

2017-03-13 Thread Sam Elamin
Hi Folks This is more of a general question. What's everyone using for their CI /CD when it comes to spark We are using Pyspark but potentially looking to make to spark scala and Sbt in the future One of the suggestions was jenkins but I know the UI isn't great for new starters so I'd rather

how to construct parameter for model.transform() from datafile

2017-03-13 Thread jinhong lu
Hi, all: I got these training data: 0 31607:17 0 111905:36 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1 0 407:14 2905:2 5209:2 6509:2 6909:2