Re: the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread Ted Yu
Does adding -X to mvn command give you more information ? Cheers On Sun, Jun 25, 2017 at 5:29 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > Today I use new PC to compile SPARK. > At the beginning, it worked well. > But it stop at some point. > the content in consle is : >

What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread jeff saremi
You can do a map() using a select and functions/UDFs. But how do you process a partition using SQL?

Re: issue about the windows slice of stream

2017-06-25 Thread ??????????
Hi all, Let me add more info about this. The log showed: 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms 17/06/25 17:31:26

Re: Can we access files on Cluster mode

2017-06-25 Thread sudhir k
Thank you . I guess I have to use common mount or s3 to access those files. On Sun, Jun 25, 2017 at 4:42 AM Mich Talebzadeh wrote: > Thanks. In my experience certain distros like Cloudera only support yarn > client mode so AFAIK the driver stays on the Edge node.

Re: How does HashPartitioner distribute data in Spark?

2017-06-25 Thread Russell Spitzer
A more clear explanation. `parallelize` does not apply a partitioner. We can see this pretty quickly with a quick code example scala> val rdd1 = sc.parallelize(Seq(("aa" , 1),("aa",2), ("aa", 3))) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :24

Re: Could you please add a book info on Spark website?

2017-06-25 Thread Sean Owen
Please get Packt to fix their existing PR. It's been open for months https://github.com/apache/spark-website/pull/35 On Sun, Jun 25, 2017 at 12:33 PM Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Hi Sean, > > Last time, you helped me add a book info (in the books section) on this

Re: Could you please add a book info on Spark website?

2017-06-25 Thread Md. Rezaul Karim
Thanks, Sean. I will ask them to do so. Regards, _ *Md. Rezaul Karim*, BSc, MSc, PhD Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA Business Park, Dangan, Galway, Ireland Web: http://www.reza-analytics.eu/index.html

RDD and DataFrame persistent memory usage

2017-06-25 Thread Ashok Kumar
Gurus, I understand when we create RDD in Spark it is immutable. So I have few points please: - When RDD is created that is just a pointer. Not most Spark operations it is lazy not consumed until a collection operation done that affects RDD? - When a DF is created from RDD does that

Re: Can we access files on Cluster mode

2017-06-25 Thread Mich Talebzadeh
Hi Anastasios. Are you implying that in Yarn cluster mode even if you submit your Spark application on an Edge node the driver can start on any node. I was under the impression that the driver starts from the Edge node? and the executors can be on any node in the cluster (where Spark agents are

Re: Question on Spark code

2017-06-25 Thread Sean Owen
Maybe you are looking for declarations like this. "=> String" means the arg isn't evaluated until it's used, which is just what you want with log statements. The message isn't constructed unless it will be logged. protected def logInfo(msg: => String) { On Sun, Jun 25, 2017 at 10:28 AM kant

Re: Can we access files on Cluster mode

2017-06-25 Thread Mich Talebzadeh
Thanks. In my experience certain distros like Cloudera only support yarn client mode so AFAIK the driver stays on the Edge node. Happy to be corrected :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Can we access files on Cluster mode

2017-06-25 Thread Anastasios Zouzias
Hi Mich, If the driver starts on the edge node with cluster mode, then I don't see the difference between client and cluster deploy mode. In cluster mode, it is the responsibility of the resource manager (yarn, etc) to decide where to run the driver (at least for spark 1.6 this is what I have

Re: Question on Spark code

2017-06-25 Thread Herman van Hövell tot Westerflier
I am not getting the question. The logging trait does exactly what is says on the box, I don't see what string concatenation has to do with it. On Sun, Jun 25, 2017 at 11:27 AM, kant kodali wrote: > Hi All, > > I came across this file

Problem in avg function Spark 1.6.3 using spark-shell

2017-06-25 Thread Eko Susilo
Hi, I have a data frame collection called “secondDf” when I tried to perform groupBy and then sum of each column it works perfectly. However when I tried to calculate average of that column it says the column name is not found. The details are as follow val total = secondDf.filter("ImageWidth

the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread ??????????
Hi all, Today I use new PC to compile SPARK. At the beginning, it worked well. But it stop at some point. the content in consle is : [INFO] [INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @ spark-parent_2.11 --- [INFO] [INFO] ---

Re: Can we access files on Cluster mode

2017-06-25 Thread Anastasios Zouzias
Just to note that in cluster mode the spark driver might run on any node of the cluster, hence you need to make sure that the file exists on *all* nodes. Push the file on all nodes or use client deploy-mode. Best, Anastasios Am 24.06.2017 23:24 schrieb "Holden Karau" : >

Could you please add a book info on Spark website?

2017-06-25 Thread Md. Rezaul Karim
Hi Sean, Last time, you helped me add a book info (in the books section) on this page https://spark.apache.org/documentation.html. Could you please add another book info. Here's necessary information about the book: *Title*: Scala and Spark for Big Data Analytics *Authors*: Md. Rezaul Karim,

Question on Spark code

2017-06-25 Thread kant kodali
Hi All, I came across this file https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala and I am wondering what is the purpose of this? Especially it doesn't prevent any string concatenation and also the if checks are already done by the library

How to Fill Sparse Data With the Previous Non-Empty Value in SPARQL Dataset

2017-06-25 Thread Carlo . Allocca
Dear All, I need to apply a dataset transformation to replace null values with the previous Non-null Value. As an example, I report the following: from: id | col1 - 1 null 1 null 2 4 2 null 2 null 3 5 3 null 3 null to: id | col1 - 1 null 1 null 2 4 2

Re: Question on Spark code

2017-06-25 Thread Sean Owen
I think it's more precise to say args like any expression are evaluated when their value is required. It's just that this special syntax causes extra code to be generated that makes it effectively a function passed, not value, and one that's lazily evaluated. Look at the bytecode if you're

Re: Question on Spark code

2017-06-25 Thread kant kodali
impressive! I need to learn more about scala. What I mean stripping away conditional check in Java is this. static final boolean isLogInfoEnabled = false; public void logMessage(String message) { if(isLogInfoEnabled) { log.info(message) } } If you look at the byte code the dead

Re: Question on Spark code

2017-06-25 Thread kant kodali
@Sean Got it! I come from Java world so I guess I was wrong in assuming that arguments are evaluated during the method invocation time. How about the conditional checks to see if the log is InfoEnabled or DebugEnabled? For Example, if (log.isInfoEnabled) log.info(msg) I hear we should use guard

Re: [E] Re: Spark Job is stuck at SUBMITTED when set Driver Memory > Executor Memory

2017-06-25 Thread Mich Talebzadeh
This typically works ok for standalone mode with moderate resources ${SPARK_HOME}/bin/spark-submit \ --driver-memory 6G \ --executor-memory 2G \ --num-executors 2 \ --executor-cores 2 \ --master

Meetup in Taiwan

2017-06-25 Thread Yang Bryan
Hi, I'm Bryan, the co-founder of Taiwan Spark User Group. We discuss, share information on https://www.facebook.com/groups/spark.tw/. We have physical meetup twice a month. Please help us add on the official website. And We will hold a code competition about Spark, could we print the logo of

RE: HDP 2.5 - Python - Spark-On-Hbase

2017-06-25 Thread Mahesh Sawaiker
Ayan, The location of the logging class was moved from Spark 1.6 to Spark 2.0. Looks like you are trying to run 1.6 code on 2.0, I have ported some code like this before and if you have access to the code you can recompile it by changing reference to Logging class and directly use the slf4

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Why would you like to do so? I think there's no need for us to explicitly ask for a forEachPartition in spark sql because tungsten is smart enough to figure out whether a sql operation could be applied on each partition or there has to be a shuffle. On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and often enough this did hurt performance. Even now Tungsten will not do the best job every time: so the question from the OP is still germane. 2017-06-25 19:18 GMT-07:00 Ryan : > Why would you like to

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Do you mean you'd like to partition the data with specific key? If we issue a cluster by/repartition, following an operation needn't shuffle, it's effectively the same as for each partition I think. Or we could always get the underlying rdd from dataset, translating sql operation to function...

Re: Spark streaming persist to hdfs question

2017-06-25 Thread ayan guha
I would suggest to use Flume, if possible, as it has in built HDFS log rolling capabilities On Mon, Jun 26, 2017 at 1:09 PM, Naveen Madhire wrote: > Hi, > > I am using spark streaming with 1 minute duration to read data from kafka > topic, apply transformations and

Re: Spark streaming persist to hdfs question

2017-06-25 Thread Naveen Madhire
We are also doing transformations, thats the reason using spark streaming. Does Spark streaming support tumbling windows? I was thinking I can use a window operation to writing into HDFS. Thanks On Sun, Jun 25, 2017 at 10:23 PM, ayan guha wrote: > I would suggest to use

Re: Problem in avg function Spark 1.6.3 using spark-shell

2017-06-25 Thread Riccardo Ferrari
Hi, Looks like you performed an aggregation on the ImageWidth column already. The error itself is quite self-explanatory: Cannot resolve column name "ImageWidth" among (MainDomainCode, *avg(length(ImageWidth))*) The column available in that DF are MainDomainCode and avg(length(ImageWidth)) so

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-25 Thread ayan guha
Hi I am using following: --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ Is it compatible with Spark 2.X? I would like to use it Best Ayan On Sat, Jun 24, 2017 at 2:09 AM, Weiqing Yang wrote: >

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-25 Thread Ryan
have to say sorry. I check the code again, Broadcast is serializable and should be able to use within lambdas/inner classes. actually according to the javadoc it should be used in this way to avoid the large contained value object's serialization. so what's wrong with the first approach? On Sat,

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread jeff saremi
My specific and immediate need is this: We have a native function wrapped in JNI. To increase performance we'd like to avoid calling it record by record. mapPartitions() give us the ability to invoke this in bulk. We're looking for a similar approach in SQL.

Spark streaming persist to hdfs question

2017-06-25 Thread Naveen Madhire
Hi, I am using spark streaming with 1 minute duration to read data from kafka topic, apply transformations and persist into HDFS. The application is creating a new directory every 1 minute with many partition files(= nbr of partitions). What parameter should I need to change/configure to persist

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
ok.. for plain sql, I've no idea other than defining a udaf On Mon, Jun 26, 2017 at 10:59 AM, jeff saremi wrote: > My specific and immediate need is this: We have a native function wrapped > in JNI. To increase performance we'd like to avoid calling it record by >