Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-06 Thread Priya Ch
Running 'lsof' will let us know the open files but how do we come to know the root cause behind opening too many files. Thanks, Padma CH On Wed, Jan 6, 2016 at 8:39 AM, Hamel Kothari wrote: > The "Too Many Files" part of the exception is just indicative of the fact >

RE: Out of memory issue

2016-01-06 Thread Ewan Leith
Hi Muthu, this could be related to a known issue in the release notes http://spark.apache.org/releases/spark-release-1-6-0.html Known issues SPARK-12546 - Save DataFrame/table as Parquet with dynamic partitions may cause OOM; this can be worked around by decreasing the memory used by both

Spark DataFrame limit question

2016-01-06 Thread Arkadiusz Bicz
Hi, Does limit working for DataFrames, Spark SQL and Hive Context without full scan for parquet in Spark 1.6 ? I just used it to create small parquet file from large number of parquet files and found out that it doing full scan of all data instead just read limited number: All of bellow

RE: How to accelerate reading json file?

2016-01-06 Thread Ewan Leith
If you already know the schema, then you can run the read with the schema parameter like this: val path = "examples/src/main/resources/jsonfile" val jsonSchema = StructType( StructField("id",StringType,true) :: StructField("reference",LongType,true) ::

How to insert df in HBASE

2016-01-06 Thread Sadaf
HI, I need to insert a Dataframe in to hbase using scala code. Can anyone guide me how to achieve this? Any help would be much appreciated. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-df-in-HBASE-tp25891.html Sent from the

Re: How to concat few rows into a new column in dataframe

2016-01-06 Thread Sabarish Sasidharan
You can just repartition by the id, if the final objective is to have all data for the same key in the same partition. Regards Sab On Wed, Jan 6, 2016 at 11:02 AM, Gavin Yue wrote: > I found that in 1.6 dataframe could do repartition. > > Should I still need to do

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2016-01-06 Thread Deenar Toraskar
Hi guys 1. >> Add this jar to the classpath of all NodeManagers in your cluster. A related question on configuration of the auxillary shuffle service. *How do i find the classpath for NodeManager?* I tried finding all places where the existing mapreduce shuffle jars are present and place

spark 1.6 Issue

2016-01-06 Thread kali.tumm...@gmail.com
Hi All, I am running my app in IntelliJ Idea (locally) my config local[*] , the code worked ok with spark 1.5 but when I upgraded to 1.6 I am having below issue. is this a bug in 1.6 ? I change back to 1.5 it worked ok without any error do I need to pass executor memory while running in local

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Kostiantyn Kudriavtsev
Hi guys, the only one big issue with this approach: > spark.hadoop.s3a.access.key is now visible everywhere, in logs, in spark > webui and is not secured at all... On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev wrote: > thanks Jerry, it works! > really

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Jerry Lam
Hi Kostiantyn, Yes. If security is a concern then this approach cannot satisfy it. The keys are visible in the properties files. If the goal is to hide them, you might be able go a bit further with this approach. Have you look at spark security page? Best Regards, Jerry Sent from my iPhone

Re: sparkR ORC support.

2016-01-06 Thread Sandeep Khurana
Felix I tried the option suggested by you. It gave below error. I am going to try the option suggested by Prem . Error in writeJobj(con, object) : invalid jobj 1 8 stop("invalid jobj ", value$id) 7 writeJobj(con, object) 6 writeObject(con, a) 5 writeArgs(rc, args) 4 invokeJava(isStatic = TRUE,

Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext. sc <- sparkR.init() sqlContext <- sparkRHive.init(sc) 2016-01-06 20:35 GMT+08:00 Sandeep Khurana : > Felix > > I tried the option suggested by you. It gave below error. I am going to > try the option suggested by Prem .

Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Context: Process data coming from Kafka and send back results to Kafka. Issue: Each events could take several seconds to process (Work in progress to improve that). During that time, events (and RDD) do accumulate. Intermediate events (by key) do not have to be processed, only the last ones. So

Re: How to accelerate reading json file?

2016-01-06 Thread Vijay Gharge
Hi all I want to ask how exactly it differs while reading >1 tb file on standalone cluster vs yarn or mesos cluster ? On Wednesday 6 January 2016, Gavin Yue wrote: > I am trying to read json files following the example: > > val path =

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
Have you read http://kafka.apache.org/documentation.html#compaction On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour wrote: > Context: Process data coming from Kafka and send back results to Kafka. > > Issue: Each events could take several seconds to process (Work in progress

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Thanks for your answer, As I understand it, a consumer that stays caught-up will read every message even with compaction. So for a pure Kafka Spark Streaming It will not be a solution. Perhaps I could reconnect to the Kafka topic after each process to get the last state of events and then

Re: Spark on Apache Ingnite?

2016-01-06 Thread Ravi Kora
We have been using ignite on spark for one of our use cases. We are using Ignite’s SharedRDD feature. Following links should get you started in that direction. We have been using for the basic use case and works fine so far. There is not a whole lot of documentation on spark-ignite integration

fp growth - clean up repetitions in input

2016-01-06 Thread matd
Hi folks, I'm interested in using FP growth to identify sequence patterns. Unfortunately, my input sequences have cycles : ...1,2,4,1,2,5... And this is not supported by fp-growth (I get a SparkException: Items in a transaction must be unique but got WrappedArray) Do you know a way to identify

What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread unk1102
Hi As part of Spark 1.6 release what should be ideal value or unit for spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is it correct? Please guide. -- View this message in context:

Re: [Spark-SQL] Custom aggregate function for GrouppedData

2016-01-06 Thread Michael Armbrust
In Spark 1.6 GroupedDataset has mapGroups, which sounds like what you are looking for. You can also write a custom Aggregator

Why is this job running since one hour?

2016-01-06 Thread unk1102
Hi I have one main Spark job which spawns multiple child spark jobs. One of the child spark job is running for an hour and it keeps on hanging there I have taken snap shot please see -- View

Re: problem building spark on centos

2016-01-06 Thread Ted Yu
w.r.t. the second error, have you read this ? http://www.captaindebug.com/2013/03/mavens-non-resolvable-parent-pom-problem.html#.Vo1fFGSrSuo On Wed, Jan 6, 2016 at 9:49 AM, Jade Liu wrote: > I’m using 3.3.9. Thanks! > > Jade > > From: Ted Yu > Date:

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Andy Davidson
Hi Micheal I really appreciate your help. I The following code works. Is there a way this example can be added to the distribution to make it easier for future java programmers? It look me a long time get to this simple solution. I'll need to tweak this example a little to work with the new

Re: How to insert df in HBASE

2016-01-06 Thread Ted Yu
Cycling prior discussion: http://search-hadoop.com/m/q3RTtX7POh17hqdj1 On Wed, Jan 6, 2016 at 3:07 AM, Sadaf wrote: > HI, > > I need to insert a Dataframe in to hbase using scala code. > Can anyone guide me how to achieve this? > > Any help would be much appreciated. >

Predictive Modelling in sparkR

2016-01-06 Thread Chandan Verma
Has anyone tried building logistic regression model in SparkR.. Is it recommended? Does it take longer to do process than what can be done in simple R?

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
> > I really appreciate your help. I The following code works. > Glad you got it to work! Is there a way this example can be added to the distribution to make it > easier for future java programmers? It look me a long time get to this > simple solution. > I'd welcome a pull request that added

Re: problem building spark on centos

2016-01-06 Thread Jade Liu
I’ve changed the scala version to 2.10. With this command: build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package Build was successful. But make a runnable version: /make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Wei Chen
Thank you. I have tried the window function as follows: import pyspark.sql.functions as f sqc = sqlContext from pyspark.sql import Window import pandas as pd DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3], 'b': [1,2,3,1,2,3,1,2,3], 'c': [1,2,3,4,5,6,7,8,9]

Re: spark 1.6 Issue

2016-01-06 Thread Mark Hamstra
It's not a bug, but a larger heap is required with the new UnifiedMemoryManager: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L172 On Wed, Jan 6, 2016 at 6:35 AM, kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > Hi

Re: spark 1.6 Issue

2016-01-06 Thread Sri
Hi Mark, I did changes to VM options in edit configuration section for the main method and Scala test case class in IntelliJ which worked ok when I executed individually, but while running maven install to create jar file the test case is failing. Can I add VM options in spark conf set in

When to use streaming state and when an external storage?

2016-01-06 Thread Rado Buranský
What are pros/cons and general idea behind state in Spark Streaming? By state I mean state created by "mapWithState" (or updateStateByKey). When to use it and when not? Is it a good idea to accumulate a state in jobs running continuously years? Example: Remember IP adresses of returning

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
if you don't have hot users, you can use the user id as the hash key for publishing into kafka. That will put all events for a given user in the same partition per batch. Then you can do foreachPartition with a local map to store just a single event per user, e.g. foreachPartition { p => val m

Re: sparkR ORC support.

2016-01-06 Thread Felix Cheung
Yes, as Yanbo suggested, it looks like there is something wrong with the sqlContext. Could you forward us your code please? On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang" wrote: You should ensure your sqlContext is HiveContext. sc <- sparkR.init() sqlContext

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Thanks Cody again for your answer. The idea here is to process all events but only launch the big job (that is longer than the batch size) if they are the last events for an id considering the current state of data. Knowing if they are the last is my issue in fact. So I think I need two jobs.

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
The following lines are my understanding of Spark Streaming AFAIK, I could be wrong: Spark Streaming processes data from a Stream in micro-batch, one at a time. When a process takes time, DStream's RDD are accumulated. So in my case (my process takes time) DStream's RDD are accumulated. What I

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
If your job consistently takes longer than the batch time to process, you will keep lagging longer and longer behind. That's not sustainable, you need to increase batch sizes or decrease processing time. In your case, probably increase batch size, since you're pre-filtering it down to only 1

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
oh, and I think I installed jekyll using "gem install jekyll" On Wed, Jan 6, 2016 at 4:17 PM, Michael Armbrust wrote: > from docs/ run: > > SKIP_API=1 jekyll serve --watch > > On Wed, Jan 6, 2016 at 4:12 PM, Andy Davidson < > a...@santacruzintegration.com> wrote: > >> Hi

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
from docs/ run: SKIP_API=1 jekyll serve --watch On Wed, Jan 6, 2016 at 4:12 PM, Andy Davidson wrote: > Hi Michael > > I am happy to add some documentation. > > I forked the repo but am having trouble with the markdown. The code > examples are not rendering

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Ted Yu
Turns out that I should have specified -i to my former grep command :-) Thanks Marcelo But does this mean that specifying custom value for parameter spark.memory.offheap.size would not take effect ? Cheers On Wed, Jan 6, 2016 at 2:47 PM, Marcelo Vanzin wrote: > Try "git

Re: Timeout connecting between workers after upgrade to 1.6

2016-01-06 Thread Michael Armbrust
Logs from the workers? On Wed, Jan 6, 2016 at 1:57 PM, Jeff Jones wrote: > I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We > are now seeing regular timeouts between two of the workers when making > connections. These workers and the same

Timeout connecting between workers after upgrade to 1.6

2016-01-06 Thread Jeff Jones
I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We are now seeing regular timeouts between two of the workers when making connections. These workers and the same driver code worked fine running on 1.4.1 and finished in under a second. Any thoughts on what might have

Re: Spark Token Expired Exception

2016-01-06 Thread Nikhil Gs
These are my versions cdh version = 5.4.1 spark version, 1.3.0 kafka = KAFKA-0.8.2.0-1.kafka1.3.1.p0.9 hbase versions = 1.0.0 Regards, Nik. On Wed, Jan 6, 2016 at 3:50 PM, Ted Yu wrote: > Which Spark / hadoop release are you using ? > > Thanks > > On Wed, Jan 6, 2016 at

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Marcelo Vanzin
Try "git grep -i spark.memory.offheap.size"... On Wed, Jan 6, 2016 at 2:45 PM, Ted Yu wrote: > Maybe I looked in the wrong files - I searched *.scala and *.java files (in > latest Spark 1.6.0 RC) for '.offheap.' but didn't find the config. > > Can someone enlighten me ? > >

Date and Time as a Feature

2016-01-06 Thread Jorge Machado
Hello all, I'm new to machine learning. I'm trying to predict some electric usage . The data is : 2015-12-10-10:00, 1200 2015-12-11-10:00, 1150 My question is : What is the best way to turn date and time into feature on my Vector ? Something like this : Vector (1200, [2015,12,10,10,10] )?

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Andy Davidson
Hi Michael I am happy to add some documentation. I forked the repo but am having trouble with the markdown. The code examples are not rendering correctly. I am on a mac and using https://itunes.apple.com/us/app/marked-2/id890031187?mt=12 I use a emacs or some other text editor to change the md.

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Ted Yu
Maybe I looked in the wrong files - I searched *.scala and *.java files (in latest Spark 1.6.0 RC) for '.offheap.' but didn't find the config. Can someone enlighten me ? Thanks On Wed, Jan 6, 2016 at 2:35 PM, Jakob Odersky wrote: > Check the configuration guide for a

Re: Problems with too many checkpoint files with Spark Streaming

2016-01-06 Thread Tathagata Das
Could you show a sample of the file names? There are multiple things that are using UUIDs so would be good to see what are 100s of directories that being generated every second. If you are checkpointing every 400s then there shouldnt be checkpoint directories written every second. They should be

connecting beeline to spark sql thrift server

2016-01-06 Thread Sunil Kumar
Hi, I have an AWS spark EMR cluster running with spark 1.5.2, hadoop 2.6 and hive 1.0.0I brought up the spark sql thriftserver on this cluster with spark.sql.hive.metastore version set to 1.0 When I try to connect to this thriftserver remotely using beeline packaged  with

Re: Why is this job running since one hour?

2016-01-06 Thread Jakob Odersky
What is the job doing? How much data are you processing? On 6 January 2016 at 10:33, unk1102 wrote: > Hi I have one main Spark job which spawns multiple child spark jobs. One of > the child spark job is running for an hour and it keeps on hanging there I > have taken snap

Problems with too many checkpoint files with Spark Streaming

2016-01-06 Thread Jan Algermissen
Hi, we are running a streaming job that processes about 500 events per 20s batches and uses updateStateByKey to accumulate Web sessions (with a 30 Minute live time). The checkpoint intervall is set to 20xBatchInterval, that is 400s. Cluster size is 8 nodes. We are having trouble with the

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Jakob Odersky
Check the configuration guide for a description on units ( http://spark.apache.org/docs/latest/configuration.html#spark-properties). In your case, 5GB would be specified as 5g. On 6 January 2016 at 10:29, unk1102 wrote: > Hi As part of Spark 1.6 release what should be

Need Help in Spark Hive Data Processing

2016-01-06 Thread Balaraju.Kagidala Kagidala
Hi , I am new user to spark. I am trying to use Spark to process huge Hive data using Spark DataFrames. I have 5 node Spark cluster each with 30 GB memory. i am want to process hive table with 450GB data using DataFrames. To fetch single row from Hive table its taking 36 mins. Pls suggest me

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-06 Thread Priya Ch
The line of code which I highlighted in the screenshot is within the spark source code. Spark implements sort-based shuffle implementation and the spilled files are merged using the merge sort. Here is the link https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf

Re: Out of memory issue

2016-01-06 Thread Muthu Jayakumar
Thanks Ewan Leith. This seems like a good start, as it seem to match up to the symptoms I am seeing :). But, how do I specify "parquet.memory.pool.ratio"? Parquet code seem to take this parameter from ParquetOutputFormat.getRecordWriter() (ref code: float

Re: Need Help in Spark Hive Data Processing

2016-01-06 Thread Jeff Zhang
It depends on how you fetch the single row. Does your query complex ? On Thu, Jan 7, 2016 at 12:47 PM, Balaraju.Kagidala Kagidala < balaraju.kagid...@gmail.com> wrote: > Hi , > > I am new user to spark. I am trying to use Spark to process huge Hive > data using Spark DataFrames. > > > I have 5

Re: org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-06 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtAiQta22XrCI On Wed, Jan 6, 2016 at 8:41 PM, Jia Zou wrote: > Dear all, > > I am using Spark1.5.2 and Tachyon0.7.1 to run KMeans with > inputRDD.persist(StorageLevel.OFF_HEAP()). > > I've set tired storage for

spark dataframe read large mysql table running super slow

2016-01-06 Thread fightf...@163.com
Hi, Recently I am planning to use spark sql to run some tests over large mysql datatable, and trying to compare the performance between spark and mycat. However, the load is super slow and hope someone can help tune on this. Environment: Spark 1.4.1 Code snipet: val prop = new

Re: LogisticsRegression in ML pipeline help page

2016-01-06 Thread Wen Pei Yu
You can get old resource under http://spark.apache.org/documentation.html And linear doc here for 1.5.2 http://spark.apache.org/docs/1.5.2/mllib-linear-methods.html#logistic-regression http://spark.apache.org/docs/1.5.2/ml-linear-methods.html Regards. Yu Wenpei. From: Arunkumar Pillai

org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-06 Thread Jia Zou
Dear all, I am using Spark1.5.2 and Tachyon0.7.1 to run KMeans with inputRDD.persist(StorageLevel.OFF_HEAP()). I've set tired storage for Tachyon. It is all right when working set is smaller than available memory. However, when working set exceeds available memory, I keep getting errors like

LogisticsRegression in ML pipeline help page

2016-01-06 Thread Arunkumar Pillai
Hi I need help page for Logistics Regression in ML pipeline. when i browsed I'm getting the 1.6 help please help me. -- Thanks and Regards Arun

Re: Update Hive tables from Spark without loading entire table in to a dataframe

2016-01-06 Thread Jörn Franke
You can mark the table as transactional and then you can do single updates. > On 07 Jan 2016, at 08:10, sudhir wrote: > > Hi, > > I have a hive table of 20Lakh records and to update a row I have to load the > entire table in dataframe and process that and then Save it

Re: Need Help in Spark Hive Data Processing

2016-01-06 Thread Jörn Franke
You need the table in an efficient format, such as Orc or parquet. Have the table sorted appropriately (hint: most discriminating column in the where clause). Do not use SAN or virtualization for the slave nodes. Can you please post your query. I always recommend to avoid single updates where

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
That should read "I think your missing the --name option". Sorry about that. On Wed, Jan 6, 2016 at 3:03 PM, Todd Nist wrote: > Hi Jade, > > I think you "--name" option. The makedistribution should look like this: > > ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
Hi Jade, I think you "--name" option. The makedistribution should look like this: ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests. As for why it failed to build with scala 2.11, did you run the

Spark Token Expired Exception

2016-01-06 Thread Nikhil Gs
Hello Team, Thank you for your time in advance. Below are the log lines of my spark job which is used for consuming the messages from Kafka Instance and its loading to Hbase. I have noticed the below Warn lines and later it resulted to errors. But I noticed that, exactly after 7 days the token

Re: error writing to stdout

2016-01-06 Thread Bryan Cutler
This is a known issue https://issues.apache.org/jira/browse/SPARK-9844. As Noorul said, it is probably safe to ignore as the executor process is already destroyed at this point. On Mon, Dec 21, 2015 at 8:54 PM, Noorul Islam K M wrote: > carlilek

Re: problem building spark on centos

2016-01-06 Thread Jade Liu
Hi, Todd: Thanks for your suggestion. Yes I did run the ./dev/change-scala-version.sh 2.11 script when using scala version 2.11. I just tried this as you suggested: ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver –DskipTests

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
Not sure, I just built it with java 8, but 7 is supported so that should be fine. Are you using maven 3.3.3 + ? RADTech:spark-1.5.2 tnist$ mvn -version Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Apache Maven 3.3.3

Re: Spark Token Expired Exception

2016-01-06 Thread Ted Yu
Which Spark / hadoop release are you using ? Thanks On Wed, Jan 6, 2016 at 12:16 PM, Nikhil Gs wrote: > Hello Team, > > > Thank you for your time in advance. > > > Below are the log lines of my spark job which is used for consuming the > messages from Kafka Instance

Re: Unable to run spark SQL Join query.

2016-01-06 Thread ๏̯͡๏
Any suggestions on how to do joins in Spark SQL. Above Spark SQL format/Syntax is not working. On Mon, Jan 4, 2016 at 2:33 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > There are three tables in action here. > > Table A (success_events.sojsuccessevents1) JOIN TABLE B (dw_bid) to > create

Re: problem building spark on centos

2016-01-06 Thread Marcelo Vanzin
If you're trying to compile against Scala 2.11, you're missing "-Dscala-2.11" in that command. On Wed, Jan 6, 2016 at 12:27 PM, Jade Liu wrote: > Hi, Todd: > > Thanks for your suggestion. Yes I did run the ./dev/change-scala-version.sh > 2.11 script when using scala version

Re: problem building spark on centos

2016-01-06 Thread Jade Liu
Yes I’m using maven 3.3.9. From: Todd Nist > Date: Wednesday, January 6, 2016 at 12:33 PM To: Jade Liu > Cc: "user@spark.apache.org"

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Kristina Rogale Plazonic
Try redefining your window, without sortBy part. In other words, rerun your code with window = Window.partitionBy("a") The thing is that the window is defined differently in these two cases. In your example, in the group where "a" is 1, - If you include "sortBy" option, it is a rolling