date:20161020

Re: [Spark ML] Using GBTClassifier in OneVsRest

2016-10-20 Thread Nick Pentreath

Currently no - GBT implements the predictors, not the classifier interface. It might be possible to wrap it in a wrapper that extends the Classifier trait. Hopefully GBT will support multi-class at some point. But you can use RandomForest which does support multi-class. On Fri, 21 Oct 2016 at 02:

Can we disable parquet logs in Spark?

2016-10-20 Thread Yu, Yucai

Hi, I see lots of parquet logs in container logs(YARN mode), like below: stdout: Oct 21, 2016 2:27:30 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 8,448B for [ss_promo_sk] INT32: 5,996 values, 8,513B raw, 8,409B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED,

Re: ALS.trainImplicit block sizes

2016-10-20 Thread Nick Pentreath

The blocks params will set both user and item blocks. Spark 2.0 supports user and item blocks for PySpark: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra wrote: > Hi, > > I have a question about the bloc

Re: spark pi example fail on yarn

2016-10-20 Thread Xi Shen

I see, I had this issue before. I think you are using Java 8, right? Because Java 8 JVM requires more bootstrap heap memory. Turning off the memory check is an unsafe way to avoid this issue. I think it is better to increase the memory ratio, like this: yarn.nodemanager.vmem-pmem-ratio

ALS.trainImplicit block sizes

2016-10-20 Thread Nikhil Mishra

Hi, I have a question about the block size to be specified in ALS.trainImplicit() in pyspark (Spark 1.6.1). There is only one block size parameter to be specified. I want to know if that would result in partitioning both the users as well as the items axes. For example, I am using the following c

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-20 Thread Efe Selcuk

Thanks for the response. What do you mean by "semantically" the same? They're both Datasets of the same type, which is a case class, so I would expect compile-time integrity of the data. Is there a situation where this wouldn't be the case? Interestingly enough, if I instead create an empty rdd wi

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-20 Thread Agraj Mangal

I believe this normally comes when Spark is unable to perform union due to "difference" in schema of the operands. Can you check if the schema of both the datasets are semantically same ? On Tue, Oct 18, 2016 at 9:06 AM, Efe Selcuk wrote: > Bump! > > On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk w

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li

I modified yarn-site.xml yarn.nodemanager.vmem-check-enabled to false and it works for yarn-client and spark-shell On Fri, Oct 21, 2016 at 10:59 AM, Li Li wrote: > I found a warn in nodemanager log. is the virtual memory exceed? how > should I config yarn to solve this problem? > > 2016-10-21 10:

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li

I found a warn in nodemanager log. is the virtual memory exceed? how should I config yarn to solve this problem? 2016-10-21 10:41:12,588 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 20299 for container-id container_14770

Re: spark pi example fail on yarn

2016-10-20 Thread Saisai Shao

It is not Spark has difficulty to communicate with YARN, it simply means AM is exited with FINISHED state. I'm guessing it might be related to memory constraints for container, please check the yarn RM and NM logs to find out more details. Thanks Saisai On Fri, Oct 21, 2016 at 8:14 AM, Xi Shen

Re: spark pi example fail on yarn

2016-10-20 Thread Xi Shen

16/10/20 18:12:14 ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! From this, I think it is spark has difficult communicating with YARN. You should check your Spark log. On Fri, Oct 21, 2016 at 8:06 AM Li Li wrote: which log file should I On

[Spark ML] Using GBTClassifier in OneVsRest

2016-10-20 Thread ansari

It appears as if the inheritance hierarchy doesn't allow GBTClassifiers to be used as the binary classifier in a OneVsRest trainer. Is there a simple way to use gradient-boosted trees for multiclass (not binary) problems? Specifically, it complains that GBTClassifier doesn't inherit from Classifie

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li

which log file should I On Thu, Oct 20, 2016 at 10:02 PM, Saisai Shao wrote: > Looks like ApplicationMaster is killed by SIGTERM. > > 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM > 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status: > > This container may be k

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li

which log file should I check? On Thu, Oct 20, 2016 at 11:32 PM, Amit Tank wrote: > I recently started learning spark so I may be completely wrong here but I > was facing similar problem with sparkpi on yarn. After changing yarn to > cluster mode it worked perfectly fine. > > Thank you, > Amit >

Re: spark pi example fail on yarn

2016-10-20 Thread Li Li

yes, when I use yarn-cluster mode, it's correct. What's wrong with yarn-client? the spark shell is also not work because it's client mode. Any solution for this? On Thu, Oct 20, 2016 at 11:32 PM, Amit Tank wrote: > I recently started learning spark so I may be completely wrong here but I > was fa

Re: Equivalent Parquet File Repartitioning Benefits for Join/Shuffle?

2016-10-20 Thread adam kramer

I believe what I am looking for is DataFrameWriter.bucketBy which would allow for bucketing into physical parquet files by the desired columns. Then my question would be can DataFrame/Sets take advantage of this physical bucketing upon read of the parquet file for something like a self-join on the

Spark SQL parallelize

2016-10-20 Thread Selvam Raman

Hi, I am having 40+ structured data stored in s3 bucket as parquet file . I am going to use 20 table in the use case. There s a Main table which drive the whole flow. Main table contains 1k record. My use case is for every record in the main table process the rest of table( join group by depend

RDD groupBy() then random sort each group ?

2016-10-20 Thread Yang

in my application, I group by same training samples by their model_id's (the input table contains training samples for 100k different models), then each group ends up having about 1 million training samples, then I feed that group of samples to a little Logistic Regression solver (SGD), but SGD r

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-20 Thread Cody Koeninger

Right on, I put in a PR to make a note of that in the docs. On Thu, Oct 20, 2016 at 12:13 PM, Srikanth wrote: > Yeah, setting those params helped. > > On Wed, Oct 19, 2016 at 1:32 PM, Cody Koeninger wrote: >> >> 60 seconds for a batch is above the default settings in kafka related >> to heartbea

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-20 Thread Srikanth

Yeah, setting those params helped. On Wed, Oct 19, 2016 at 1:32 PM, Cody Koeninger wrote: > 60 seconds for a batch is above the default settings in kafka related > to heartbeat timeouts, so that might be related. Have you tried > tweaking session.timeout.ms, heartbeat.interval.ms, or related >

Predict a single vector with the new spark.ml API to avoid groupByKey() after a flatMap()?

2016-10-20 Thread jglov

Is there a way to predict a single vector with the new spark.ml API, although in my case it's because I want to do this within a map() to avoid calling groupByKey() after a flatMap(): *Current code (pyspark):* % Given 'model', 'rdd', and a function 'split_element' that splits an element of the RD

Re: Mlib RandomForest (Spark 2.0) predict a single vector

2016-10-20 Thread jglov

I would also like to know if there is a way to predict a single vector with the new spark.ml API, although in my case it's because I want to do this within a map() to avoid calling groupByKey() after a flatMap(): *Current code (pyspark):* % Given 'model', 'rdd', and a function 'split_element' tha

HashingTF for TF.IDF computation

2016-10-20 Thread Ciumac Sergiu

Hello everyone, I'm having a usage issue with HashingTF class from Spark MLLIB. I'm computing TF.IDF on a set of terms/documents which later I'm using to identify most important ones in each of the input document. Below is a short code snippet which outlines the example (2 documents with 2 words

Re: spark pi example fail on yarn

2016-10-20 Thread Amit Tank

I recently started learning spark so I may be completely wrong here but I was facing similar problem with sparkpi on yarn. After changing yarn to cluster mode it worked perfectly fine. Thank you, Amit On Thursday, October 20, 2016, Saisai Shao wrote: > Looks like ApplicationMaster is killed by

Re: Microbatches length

2016-10-20 Thread Paulo Candido

Thanks a lot. I'll check it. Regards. Em qui, 20 de out de 2016 às 10:50, vincent gromakowski < vincent.gromakow...@gmail.com> escreveu: You can still implement your own logic with akka actors for instance. Based on some threshold the actor can launch spark batch mode using the same spark context

Re: spark pi example fail on yarn

2016-10-20 Thread Elek, Marton

Try to set the memory size limits. For example: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 ./examples/jars/spark-examples_2.11-2.0.0.2.5.2.0-47.jar By default yarn pre

Re: spark pi example fail on yarn

2016-10-20 Thread Saisai Shao

Looks like ApplicationMaster is killed by SIGTERM. 16/10/20 18:12:04 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM 16/10/20 18:12:04 INFO yarn.ApplicationMaster: Final app status: This container may be killed by yarn NodeManager or other processes, you'd better check yarn log to dig out more

Re: Joins of typed datasets

2016-10-20 Thread daunnc

A situation changes a bit, and the workaround is to add `K` restriction (K should be a subtype of Product); Thought I have right now another error: org.apache.spark.sql.AnalysisException: cannot resolve '(`key` = `key`)' due to data type mismatch: differing types in '(`key` = `key`)' (struct an

Re: Microbatches length

2016-10-20 Thread vincent gromakowski

You can still implement your own logic with akka actors for instance. Based on some threshold the actor can launch spark batch mode using the same spark context... It's only an idea , no real experience. Le 20 oct. 2016 1:31 PM, "Paulo Candido" a écrit : > In this case I haven't any alternatives

Re: Spark Random Forest training cost same time on yarn as on standalone

2016-10-20 Thread Xi Shen

If you are running on your local, I do not see the point that you start with 32 executors with 2 cores for each. Also, you can check the Spark web console to find out where the time spent. Also, you may want to read http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Re: Ensuring an Avro File is NOT Splitable

2016-10-20 Thread Jörn Franke

What is the use case of this? You will reduce performance significantly. Nevertheless, the way you propose is the way to go, but I do not recommend it. > On 20 Oct 2016, at 14:00, Ashan Taha wrote: > > Hi > > What’s the best way to make sure an Avro file is NOT Splitable when read in > Spark?

Ensuring an Avro File is NOT Splitable

2016-10-20 Thread Ashan Taha

Hi What's the best way to make sure an Avro file is NOT Splitable when read in Spark? Would you override the AvroKeyInputFormat.issplitable (to return false) and then call this using newAPIHadoopRDD? Or is there a better way using the sqlContext.read? Thanks in advance

Re: Microbatches length

2016-10-20 Thread Paulo Candido

In this case I haven't any alternatives to get microbatches with same length? Using another class or any configuration? I'm using socket. Thank you for attention. Em qui, 20 de out de 2016 às 09:24, 王贺(Gabriel) escreveu: > The interval is for time, so you won't get micro-batches in same data si

Re: Microbatches length

2016-10-20 Thread Gabriel

The interval is for time, so you won't get micro-batches in same data size but same time length. Yours sincerely, Gabriel (王贺) Mobile: +86 18621263813 On Thu, Oct 20, 2016 at 6:38 PM, pcandido wrote: > Hello folks, > > I'm using Spark Streaming. My question is simple: > The documentation says

Expression Encoder for Map[Int, String] in a custom Aggregator on a Dataset

2016-10-20 Thread Anton Okolnychyi

Hi all, I am trying to use my custom Aggregator on a GroupedDataset of case classes to create a hash map using Spark SQL 1.6.2. My Encoder[Map[Int, String]] is not capable to reconstruct the reduced values if I define it via ExpressionEncoder(). However, everything works fine if I define it as Enc

spark pi example fail on yarn

2016-10-20 Thread Li Li

I am setting up a small yarn/spark cluster. hadoop/yarn version is 2.7.3 and I can run wordcount map-reduce correctly in yarn. And I am using spark-2.0.1-bin-hadoop2.7 using command: ~/spark-2.0.1-bin-hadoop2.7$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client exam

Microbatches length

2016-10-20 Thread pcandido

Hello folks, I'm using Spark Streaming. My question is simple: The documentation says that microbatches arrive in intervals. The intervals are in real time (minutes, seconds). I want to get microbatches with same length, so, I can configure SS to return microbatches when it reach a determined leng

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-20 Thread Steve Loughran

> On 19 Oct 2016, at 21:46, Jakob Odersky wrote: > > Another reason I could imagine is that files are often read from HDFS, > which by default uses line terminators to separate records. > > It is possible to implement your own hdfs delimiter finder, however > for arbitrary json data, finding th

Spark Random Forest training cost same time on yarn as on standalone

2016-10-20 Thread 陈哲

I'm training random forest model using spark2.0 on yarn with cmd like: $SPARK_HOME/bin/spark-submit \ --class com.netease.risk.prediction.HelpMain --master yarn --deploy-mode client --driver-cores 1 --num-executors 32 --executor-cores 2 --driver-memory 10g --executor-memory 6g \ --conf spark.rp

Re: Can i display message on console when use spark on yarn?

2016-10-20 Thread ayan guha

What do you exactly mean by Yarn Console? We use spark-submit and it generates exactly same log as you mentioned on driver console, On Thu, Oct 20, 2016 at 8:21 PM, Jone Zhang wrote: > I submit spark with "spark-submit --master yarn-cluster --deploy-mode > cluster" > How can i display message on

Re: pyspark dataframe codes for lead lag to column

2016-10-20 Thread ayan guha

Yes there are similar functions available, depending on your spark version look up Pyspark SQL Function module documentation. I also prefer to use SQL directly within pyspark. On Thu, Oct 20, 2016 at 8:18 PM, Mendelson, Assaf wrote: > Depending on your usecase, you may want to take a look at win

Can i display message on console when use spark on yarn?

2016-10-20 Thread Jone Zhang

I submit spark with "spark-submit --master yarn-cluster --deploy-mode cluster" How can i display message on yarn console. I expect it to be like this: . 16/10/20 17:12:53 main INFO org.apache.spark.deploy.yarn.Client>SPK> Application report for application_1453970859007_481440 (state: RUNNING)

RE: pyspark dataframe codes for lead lag to column

2016-10-20 Thread Mendelson, Assaf

Depending on your usecase, you may want to take a look at window functions From: muhammet pakyürek [mailto:mpa...@hotmail.com] Sent: Thursday, October 20, 2016 11:36 AM To: user@spark.apache.org Subject: pyspark dataframe codes for lead lag to column is there pyspark dataframe codes for lead

Where condition on columns of Arrays does no longer work in spark 2

2016-10-20 Thread filthysocks

I have a Column in a DataFrame that contains Arrays and I wanna filter for equality. It does work fine in spark 1.6 but not in 2.0In spark 1.6.2: import org.apache.spark.sql.SQLContextcase class DataTest(lists: Seq[Int])val sql = new SQLContext(sc)val data = sql.createDataFrame(sc.parallelize(Seq(

Re: Dataframe schema...

2016-10-20 Thread Michael Armbrust

What is the issue you see when unioning? On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar wrote: > Hello Michael, > > Thank you for looking into this query. In my case there seem to be an > issue when I union a parquet file read from disk versus another dataframe > that I construct in-memory. Th

pyspark dataframe codes for lead lag to column

2016-10-20 Thread muhammet pakyürek

is there pyspark dataframe codes for lead lag to column? lead/lag column is something 1 lag -1lead 2 213 324 435 54 -1

How to iterate the element of an array in DataFrame?

2016-10-20 Thread Yan Facai

Hi, I want to extract the attribute `weight` of an array, and combine them to construct a sparse vector. ### My data is like this: scala> mblog_tags.printSchema root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category: strin

47 matches

Mail list logo