RE: Explanation regarding Spark Streaming

2016-08-05 Thread Mohammed Guller
Assume the batch interval is 10 seconds and batch processing time is 30 seconds. So while Spark Streaming is processing the first batch, the receiver will have a backlog of 20 seconds worth of data. By the time Spark Streaming finishes batch #2, the receiver will have 40 seconds worth of data

Re: 2.0.0: AnalysisException when reading csv/json files with dots in periods

2016-08-05 Thread Kiran Chitturi
Nevermind, there is already a Jira open for this https://issues.apache.org/jira/browse/SPARK-16698 On Fri, Aug 5, 2016 at 5:33 PM, Kiran Chitturi < kiran.chitt...@lucidworks.com> wrote: > Hi, > > During our upgrade to 2.0.0, we found this issue with one of our failing > tests. > > Any csv/json

2.0.0: Hive metastore uses a different version of derby than the Spark package

2016-08-05 Thread Kiran Chitturi
Hi, In 2.0.0, I encountered this error while using the spark-shell. Caused by: java.lang.SecurityException: sealing violation: can't seal package org.apache.derby.impl.services.timer: already loaded Full stacktrace: https://gist.github.com/kiranchitturi/9ae38f07d9836a75f233019eb2b65236 While

2.0.0: AnalysisException when reading csv/json files with dots in periods

2016-08-05 Thread Kiran Chitturi
Hi, During our upgrade to 2.0.0, we found this issue with one of our failing tests. Any csv/json files that contains field names with dots are unreadable using DataFrames. My sample csv file: flag_s,params.url_s > test,http://www.google.com In spark-shell, I ran the following code: scala>

Re: spark historyserver backwards compatible

2016-08-05 Thread Koert Kuipers
thanks On Fri, Aug 5, 2016 at 5:21 PM, Marcelo Vanzin wrote: > Yes, the 2.0 history server should be backwards compatible. > > On Fri, Aug 5, 2016 at 2:14 PM, Koert Kuipers wrote: > > we have spark 1.5.x, 1.6.x and 2.0.0 job running on yarn > > > > but

Re: spark historyserver backwards compatible

2016-08-05 Thread Marcelo Vanzin
Yes, the 2.0 history server should be backwards compatible. On Fri, Aug 5, 2016 at 2:14 PM, Koert Kuipers wrote: > we have spark 1.5.x, 1.6.x and 2.0.0 job running on yarn > > but yarn can have only one spark history server. > > what to do? is it safe to use the spark 2

spark historyserver backwards compatible

2016-08-05 Thread Koert Kuipers
we have spark 1.5.x, 1.6.x and 2.0.0 job running on yarn but yarn can have only one spark history server. what to do? is it safe to use the spark 2 history server to report on spark 1 jobs?

Re: [Spark 2.0] Error during codegen for Java POJO

2016-08-05 Thread Andy Grove
I tracked this down in the end. It turns out the POJO was not actually defined as 'public' for some reason. It seems like this should be detected as an error prior to generating code though? Thanks, Andy. -- Andy Grove Chief Architect www.agildata.com On Fri, Aug 5, 2016 at 8:28 AM, Andy

RE: Applying a limit after orderBy of big dataframe hangs spark

2016-08-05 Thread Saif.A.Ellafi
Hi thanks for the assistance, 1. Standalone 2. df.orderBy(field).limit(5000).write.parquet(...) From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Friday, August 05, 2016 4:33 PM To: Ellafi, Saif A. Cc: user @spark Subject: Re: Applying a limit after orderBy of big

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Jacek Laskowski
Hi Aseem, Ah, so I can't help you in this area. I've never worked with Spark using Java (and honestly don't want to if I don't have to). Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
If I understand your question correctly, the current implementation doesn't allow a starting value, but it's easy enough to pull off with something like: val startval = 1 df.withColumn('id', monotonicallyIncreasingId + startval) Two points - your test shows what happens with a single partition.

Re: submitting spark job with kerberized Hadoop issue

2016-08-05 Thread Jacek Laskowski
Just to make things clear...are you using Spark Standalone and Spark on YARN-specific settings? I don't think it's gonna work. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Kmeans dataset initialization

2016-08-05 Thread Tony Lane
I have all the data required for KMeans in a dataset in memory Standard approach to load this data from a file is spark.read().format("libsvm").load(filename) where the file has data in the format 0 1:0.0 2:0.0 3:0.0 How do i this from an in-memory dataset already present. Any suggestions ?

Re: Applying a limit after orderBy of big dataframe hangs spark

2016-08-05 Thread Mich Talebzadeh
Hi, 1. What scheduling are you using standalone, yarn etc? 2. How arte you limiting the df output? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Java and SparkSession

2016-08-05 Thread Andy Grove
Ah, you still have to use the JavaSparkContext rather than using the sparkSession.sparkContext ... that makes sense. Thanks for your help. Thanks, Andy. -- Andy Grove Chief Architect www.agildata.com On Fri, Aug 5, 2016 at 12:03 PM, Everett Anderson wrote: > Hi, > > Can

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
Should be pretty much the same code for Scala - import java.util.UUID UUID.randomUUID If you need it as a UDF, just wrap it accordingly. Mike On Fri, Aug 5, 2016 at 11:38 AM, Mich Talebzadeh wrote: > On the same token can one generate a UUID like below in Hive > >

flume.thrift.queuesize

2016-08-05 Thread Bhupendra Mishra
Please suggest me where/in which file I should set/configure " flume.thrift.queuesize" Many Thanks!

Applying a limit after orderBy of big dataframe hangs spark

2016-08-05 Thread Saif.A.Ellafi
Hi all, I am working with a 1.5 billon rows dataframe in a small cluster and trying to apply an orderBy operation by one of the Long Types columns. If I limit such output to some number, say 5 millon, then trying to count, persist or store the dataframe makes spark crash with losing executors

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Carlo . Allocca
Thanks Marcelo. Problem solved. Best, Carlo Hi Marcelo, Thanks you for your help. Problem solved as you suggested. Best Regards, Carlo > On 5 Aug 2016, at 18:34, Marcelo Vanzin wrote: > > On Fri, Aug 5, 2016 at 9:53 AM, Carlo.Allocca > wrote:

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Aseem Bansal
Yes. This is what I am after. But I have to use the Java API. And using the Java API I was not able to get the .as() function working On Fri, Aug 5, 2016 at 7:09 PM, Jacek Laskowski wrote: > Hi, > > I don't understand where the issue is... > > ➜ spark git:(master) ✗ cat

Re: Java and SparkSession

2016-08-05 Thread Everett Anderson
Hi, Can you say more about what goes wrong? I was migrating my code and began using this for initialization: SparkConf sparkConf = new SparkConf().setAppName(...) SparkSession sparkSession = new SparkSession.Builder().config(sparkConf).getOrCreate(); JavaSparkContext jsc = new

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Carlo . Allocca
I have also executed: mvn dependency:tree |grep log [INFO] | | +- com.esotericsoftware:minlog:jar:1.3.0:compile [INFO] +- log4j:log4j:jar:1.2.17:compile [INFO] +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile [INFO] | | +- commons-logging:commons-logging:jar:1.1.3:compile and the POM

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Marcelo Vanzin
On Fri, Aug 5, 2016 at 9:53 AM, Carlo.Allocca wrote: > > org.apache.spark > spark-core_2.10 > 2.0.0 > jar > > > org.apache.spark > spark-sql_2.10 > 2.0.0 >

Avoid Cartesian product in calculating a distance matrix?

2016-08-05 Thread Paschalis Veskos
Hello everyone, I am interested in running an application on Spark that at some point needs to compare all elements of an RDD against all others to create a distance matrix. The RDD is of type and the Pearson correlation is applied to each element against all others, generating a

Re: Add column sum as new column in PySpark dataframe

2016-08-05 Thread Nicholas Chammas
I think this is what you need: import pyspark.sql.functions as sqlf df.withColumn('total', sqlf.sum(df.columns)) Nic On Thu, Aug 4, 2016 at 9:41 AM Javier Rey jre...@gmail.com wrote: Hi everybody, > > Sorry, I sent last mesage it was imcomplete this is

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Carlo . Allocca
Please Sean, could you detail the version mismatch? Many thanks, Carlo On 5 Aug 2016, at 18:11, Sean Owen > wrote: You also seem to have a version mismatch here. -- The Open University is incorporated by Royal Charter (RC 000391), an exempt

Question: collect action returning to driver

2016-08-05 Thread RK Aduri
Rather this is a fundamental question: Was it an architectural constraint that collect action always returns the results to the driver? It is gobbling up all the driver’s memory ( in case of cache ) and why can’t we have an exclusive executor that shares the load and “somehow” merge the

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Ted Yu
One option is to clone the class in your own project. Experts may have better solution. Cheers On Fri, Aug 5, 2016 at 10:10 AM, Carlo.Allocca wrote: > Hi Ted, > > Thanks for the promptly answer. > It is not yet clear to me what I should do. > > How to fix it? > >

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Carlo . Allocca
Hi Ted, Thanks for the promptly answer. It is not yet clear to me what I should do. How to fix it? Many thanks, Carlo On 5 Aug 2016, at 17:58, Ted Yu > wrote: private[spark] trait Logging { -- The Open University is incorporated by Royal

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mich Talebzadeh
This is a UDF written for Hive to monolithically increment a column by 1 http://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFRowSequence.java package org.apache.hadoop.hive.contrib.udf; import org.apache.hadoop.hive.ql.exec.Description; import

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Ted Yu
In 2.0, Logging became private: private[spark] trait Logging { FYI On Fri, Aug 5, 2016 at 9:53 AM, Carlo.Allocca wrote: > Dear All, > > I would like to ask for your help about the following issue: > java.lang.ClassNotFoundException: > org.apache.spark.Logging > > I

ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Carlo . Allocca
Dear All, I would like to ask for your help about the following issue: java.lang.ClassNotFoundException: org.apache.spark.Logging I checked and the class Logging is not present. Moreover, the line of code where the exception is thrown final

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mich Talebzadeh
On the same token can one generate a UUID like below in Hive hive> select reflect("java.util.UUID", "randomUUID"); OK 587b1665-b578-4124-8bf9-8b17ccb01fe7 thx Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
Not that I've seen, at least not in any worker independent way. To guarantee consecutive values you'd have to create a udf or some such that provided a new row id. This probably isn't an issue on small data sets but would cause a lot of added communication on larger clusters / datasets. Mike

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
Tony - From my testing this is built with performance in mind. It's a 64-bit value split between the partition id (upper 31 bits ~1billion) and the id counter within a partition (lower 33 bits ~8 billion). There shouldn't be any added communication between the executors and the driver for

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mich Talebzadeh
Thanks Mike for this. This is Scala. As expected it adds the id column to the end of the column list starting from 0 0 scala> val df = ll_18740868.withColumn("id", monotonically_increasing_id()).show (2) +---+---+-+-+---

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread janardhan shetty
Mike, Any suggestions on doing it for consequitive id's? On Aug 5, 2016 9:08 AM, "Tony Lane" wrote: > Mike. > > I have figured how to do this . Thanks for the suggestion. It works > great. I am trying to figure out the performance impact of this. > > thanks again > > >

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
Mike. I have figured how to do this . Thanks for the suggestion. It works great. I am trying to figure out the performance impact of this. thanks again On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane wrote: > @mike - this looks great. How can i do this in java ? what

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
@mike - this looks great. How can i do this in java ? what is the performance implication on a large dataset ? @sonal - I can't have a collision in the values. On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger wrote: > You can use the monotonically_increasing_id

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Mike Metzger
You can use the monotonically_increasing_id method to generate guaranteed unique (but not necessarily consecutive) IDs. Calling something like: df.withColumn("id", monotonically_increasing_id()) You don't mention which language you're using but you'll need to pull in the sql.functions

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Sonal Goyal
Hi Tony, Would hash on the bid work for you? hash(cols: Column *): Column [image: Permalink]

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-05 Thread Luciano Resende
On Thursday, August 4, 2016, Sean Owen wrote: > You're looking for http://bahir.apache.org/ > > And we should have a release supporting Spark 2.0 in a few days... Thanks -- Luciano -- Sent from my Mobile device

[Spark 2.0] Error during codegen for Java POJO

2016-08-05 Thread Andy Grove
Hi, I've run into another issue upgrading a Spark example written in Java from Spark 1.6 to 2.0. The code is here: https://github.com/AgilData/apache-spark-examples/blob/spark_2.0/src/main/java/JRankCountiesBySexUsingDataset.java The runtime error is:

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
Ayan - basically i have a dataset with structure, where bid are unique string values bid: String val : integer I need unique int values for these string bid''s to do some processing in the dataset like id:int (unique integer id for each bid) bid:String val:integer -Tony On Fri, Aug 5,

submitting spark job with kerberized Hadoop issue

2016-08-05 Thread Aneela Saleem
Hi all, I'm trying to connect to Kerberized Hadoop cluster using spark job. I have kinit'd from command line. When i run the following job i.e., *./bin/spark-submit --keytab /etc/hadoop/conf/spark.keytab --principal spark/hadoop-master@platalyticsrealm --class com.platalytics.example.spark.App

Re: pyspark on pycharm on WINDOWS

2016-08-05 Thread pseudo oduesp
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/08/05 15:37:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Jacek Laskowski
Hi, I don't understand where the issue is... ➜ spark git:(master) ✗ cat csv-logs/people-1.csv name,city,country,age,alive Jacek,Warszawa,Polska,42,true val df = spark.read.option("header", true).csv("csv-logs/people-1.csv") val nameCityPairs = df.select('name, 'city).as[(String, String)]

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-05 Thread Flavio
That's the workaround for the code as per above: SparkConf conf = new SparkConf().set("spark.sql.warehouse.dir", "file:///C:/Users/marchifl/scalaWorkspace/SparkStreamingApp2/spark-warehouse"); SparkSession spark = SparkSession .builder()

pyspark on pycharm on WINDOWS

2016-08-05 Thread pseudo oduesp
HI, i configured th pycharm like describe on stack overflow with spark_home and hadoop_conf_dir and donwload winutils to use it with prebuild version of spark 2.0 (pyspark 2.0) and i get this error i f you can help me to find solution thanks

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread ayan guha
Hi Can you explain a little further? best Ayan On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane wrote: > I have a row with structure like > > identifier: String > value: int > > All identifier are unique and I want to generate a unique long id for the > data and get a row

Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
I have a row with structure like identifier: String value: int All identifier are unique and I want to generate a unique long id for the data and get a row object back for further processing. I understand using the zipWithUniqueId function on RDD, but that would mean first converting to RDD and

Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Aseem Bansal
I need to use few columns out of a csv. But as there is no option to read few columns out of csv so 1. I am reading the whole CSV using SparkSession.csv() 2. selecting few of the columns using DataFrame.select() 3. applying schema using the .as() function of Dataset. I tried to extent

Re: How to set nullable field when create DataFrame using case class

2016-08-05 Thread Jacek Laskowski
Hi, Seems so. It's equivalent to Seq(MyProduct(new Timestamp(0), 10)).toDS.printSchema (and now I'm wondering why I didn't pick this variant) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Re: how to run local[k] threads on a single core

2016-08-05 Thread sujeet jog
Thanks, Since i'm running in local mode, i plan to pin down the JVM to a CPU with taskset -cp , hopefully with this all the tasks should operate on the specified CPU cores.. Thanks, Sujeet On Thu, Aug 4, 2016 at 8:11 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > You could

Re: What is "Developer API " in spark documentation?

2016-08-05 Thread Ted Yu
See previous discussion : http://search-hadoop.com/m/q3RTtTvrPrc6O2h1=Re+discuss+separate+API+annotation+into+two+components+InterfaceAudience+InterfaceStability > On Aug 5, 2016, at 2:55 AM, Aseem Bansal wrote: > > Hi > > Many of spark documentation say "Developer API".

What is "Developer API " in spark documentation?

2016-08-05 Thread Aseem Bansal
Hi Many of spark documentation say "Developer API". What does that mean?

Re: How to set nullable field when create DataFrame using case class

2016-08-05 Thread Mich Talebzadeh
Hi Jacek, Is this line correct? spark.createDataset(Seq(MyProduct(new Timestamp(0), 10))).printSchema Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: pyspark pickle error when using itertools.groupby

2016-08-05 Thread Eike von Seggern
Hello, `itertools.groupby` is evaluated lazily and the `g`s in your code are generators not lists. This might cause your problem. Casting everything to lists might help here, e.g.: grp2 = [(k, list(g)) for k,g in groupby(grp1, lambda e: e[1])] HTH Eike 2016-08-05 7:31 GMT+02:00 林家銘

Re: How to set nullable field when create DataFrame using case class

2016-08-05 Thread Jacek Laskowski
Hi Michael, Since we're at it, could you please point at the code where the optimization happens? I assume you're talking about Catalyst when whole-gening the code for queries. Is this nullability (NULL value) propagation perhaps? I'd appreciate (hoping that would improve my understanding of the

Dataframe insertInto(tableName: String): Unit :Failure Scenario

2016-08-05 Thread java bigdata
Hi Team, I'm using Dataframe.insertInto(tableName: String): Unit api for hive table insertions . I would like to know what happens when out of 500 records, 100 records are inserted and failure occurs at 101th record. Will 100 records be inserted into Hive or will they rolled back , and any

Bug: Spark Streaming Application Failure Recovery Failed on Windows

2016-08-05 Thread 张梓轩
Hello Spark Group: I'm in trouble when using Spark & Yarn on *Windows*. Here is the brief summary of my conclusion: - Spark Streaming application will be dead once it trys to read data from WriteAhead on Windows. - When the driver recovers from failure, it will read data from

Re: Writing all values for same key to one file

2016-08-05 Thread colzer
In my opinion,"Append to a file" maybe is not good idea. By using `MultipleTextOutputFormat`, you can append all values for a given key to a directory for example: class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] { override def