Re: Spark Testing Library Discussion

2017-04-24 Thread Holden Karau
So 14 people have said they are available on Tuesday the 25th at 1PM pacific so we will do this meeting then ( https://doodle.com/poll/69y6yab4pyf7u8bn ). Since hangouts tends to work ok on the Linux distro I'm running my default is to host this as a "hangouts-on-air" unless there are alternative

Re: Spark Testing Library Discussion

2017-04-24 Thread Holden Karau
The (tentative) link for those interested is https://hangouts.google.com/hangouts/_/oyjvcnffejcjhi6qazf3lysypue . On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau wrote: > So 14 people have said they are available on Tuesday the 25th at 1PM > pacific so we will do this

Authorizations in thriftserver

2017-04-24 Thread vincent gromakowski
Hi, Can someone confirm authorizations aren't implemented in Spark thriftserver for SQL standard based hive authorizations? https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization If confirmed, any plan to implement it ? Thanks

how to create List in pyspark

2017-04-24 Thread Selvam Raman
documentDF = spark.createDataFrame([ ("Hi I heard about Spark".split(" "), ), ("I wish Java could use case classes".split(" "), ), ("Logistic regression models are neat".split(" "), ) ], ["text"]) How can i achieve the same df while i am reading from source? doc =

How to convert DataFrame to JSON String in Java 7

2017-04-24 Thread Devender Yadav
Hi All, How can I convert DataFrame to JSON String in Java 7. I am using Spark 1.6.3 I don't want to print on console. I need to return JSON return to another method. Thanks for your attention! Regards, Devender NOTE: This message may contain

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
Chawla, We hit this issue, too. I worked around it by setting spark.scheduler.executorTaskBlacklistTime=5000. The problem for us was that the scheduler was using locality to select the executor, even though it had already failed there. The executor task blacklist time controls how long the

community feedback on RedShift with Spark

2017-04-24 Thread Afshin, Bardia
I wanted to reach out to the community to get a understanding of what everyones experience is in regardst to maximizing performance as in decreasing load time on loading multiple large datasets to RedShift. Two approaches: 1. Spark writes file to S3, RedShift COPY INTO from S3 bucket.

Re: Arraylist is empty after JavaRDD.foreach

2017-04-24 Thread Devender Yadav
Hi Franke, I want to convert DataFrame to JSON String. Regards, Devender From: Jörn Franke Sent: Monday, April 24, 2017 11:15:08 PM To: Devender Yadav Cc: user@spark.apache.org Subject: Re: Arraylist is empty after JavaRDD.foreach I am

Re: Arraylist is empty after JavaRDD.foreach

2017-04-24 Thread Jörn Franke
I am not sure what you try to achieve here. You should never use the arraylist as you use it here as a global variable (an anti-pattern). Why don't you use the count function of the dataframe? > On 24. Apr 2017, at 19:36, Devender Yadav > wrote: > > Hi All, > >

removing columns from file

2017-04-24 Thread Afshin, Bardia
Hi there, I have a process that downloads thousands of files from s3 bucket, removes a set of columns from it, and upload it to s3. S3 is currently not the bottleneck, having a Single Master Node Spark instance is the bottleneck. One approach is to distribute the files on multiple Spark

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
Looking at the code a bit more, it appears that blacklisting is disabled by default. To enable it, set spark.blacklist.enabled=true. The updates in 2.1.0 appear to provide much more fine-grained settings for this, like the number of tasks that can fail before an executor is blacklisted for a

Re: community feedback on RedShift with Spark

2017-04-24 Thread Matt Deaver
Redshift COPY is immensely faster than trying to do insert statements. I did some rough testing of inserting data using INSERT and COPY and COPY is vastly superior to the point that if speed is at all an issue to your process you shouldn't even consider using INSERT. On Mon, Apr 24, 2017 at 11:07

Re: community feedback on RedShift with Spark

2017-04-24 Thread Aakash Basu
Hey afshin, Your point 1 is innumerably faster than the latter. It further shoots up the speed if you know how to properly use distKey and sortKey on the tables being loaded. Thanks, Aakash. https://www.linkedin.com/in/aakash-basu-5278b363 On 24-Apr-2017 10:37 PM, "Afshin, Bardia"

Arraylist is empty after JavaRDD.foreach

2017-04-24 Thread Devender Yadav
Hi All, I am using Spark 1.6.2 and Java 7. Sample json (total 100 records): {"name":"dev","salary":1,"occupation":"engg","address":"noida"} {"name":"karthik","salary":2,"occupation":"engg","address":"noida"} Useful code: final List> jsonData = new

Re: How to maintain order of key-value in DataFrame same as JSON?

2017-04-24 Thread Afshin, Bardia
Is there a API available to do this via SparkSession? Sent from my iPhone On Apr 24, 2017, at 6:20 AM, Devender Yadav > wrote: Thanks Hemanth for a quick reply. From: Hemanth Gudela

udf that handles null values

2017-04-24 Thread Zeming Yu
hi all, I tried to write a UDF that handles null values: def getMinutes(hString, minString): if (hString != None) & (minString != None): return int(hString) * 60 + int(minString[:-1]) else: return None flight2 = (flight2.withColumn("duration_minutes", udfGetMinutes("duration_h",

one hot encode a column of vector

2017-04-24 Thread Zeming Yu
how do I do one hot encode on a column of array? e.g. ['TG', 'CA'] FYI here's my code for one hot encoding normal categorical columns. How do I make it work for a column of array? from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer indexers =

pyspark vector

2017-04-24 Thread Zeming Yu
Hi all, Beginner question: what does the 3 mean in the (3,[0,1,2],[1.0,1.0,1.0])? https://spark.apache.org/docs/2.1.0/ml-features.html id | texts | vector |-|--- 0 | Array("a", "b", "c")|

Spark-SQL Query Optimization: overlapping ranges

2017-04-24 Thread Lavelle, Shawn
Hello Spark Users! Does the Spark Optimization engine reduce overlapping column ranges? If so, should it push this down to a Data Source? Example, This: Select * from table where col between 3 and 7 OR col between 5 and 9 Reduces to: Select * from table where col between 3 and 9

Re: udf that handles null values

2017-04-24 Thread cy h
Quoting Python's Coding Style Guidelines - PEP-008 https://www.python.org/dev/peps/pep-0008/#programming-recommendations Comparisons to singletons like Noneshould always be done with is or is not, never the equality operators. Cinyoung 2017. 4. 25. 오전 9:22 Zeming Yu

Re: udf that handles null values

2017-04-24 Thread Pushkar.Gujar
Someone had similar issue today at stackoverflow. http://stackoverflow.com/questions/43595201/python-how-to-convert-pyspark-column-to-date-type-if-there-are-null-values/43595728#43595728 Thank you, *Pushkar Gujar* On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu wrote: > hi

Re: Cannot convert from JavaRDD to Dataframe

2017-04-24 Thread Radhwane Chebaane
Hi, DataTypes is a Scala Array which corresponds in Java to Java Array. So you must use a String[]. However since RowFactory.create expects an array of Object as Columns content, it should be: public Row call(String line){ return RowFactory.create(new String[][]{line.split(" ")}); }

Re: Off heap memory settings and Tungsten

2017-04-24 Thread Saisai Shao
AFAIK, I don't think the off-heap memory settings is enabled automatically, there're two configurations control the tungsten off-heap memory usage: 1. spark.memory.offHeap.enabled. 2. spark.memory.offHeap.size. On Sat, Apr 22, 2017 at 7:44 PM, geoHeil wrote: > Hi,

accessing type signature

2017-04-24 Thread Bulldog20630405
When running spark from spark-shell, when each defined variable created the shell prints out the type signature of that variable along with the toString of the instance. how can i programmatically generated the same signature without using the shell (for debugging purposes) from a spark script or

Re: How to convert Dstream of JsonObject to Dataframe in spark 2.1.0?

2017-04-24 Thread Sam Elamin
you have 2 options 1 )Clean ->Write your own parser to through each property and create a dataset 2) Hacky but simple -> Convert to json string then read in using spark.read.json(jsonString) Please bear in mind the second option is expensive which is why it is hacky I wrote my own parser here

How to convert Dstream of JsonObject to Dataframe in spark 2.1.0?

2017-04-24 Thread kant kodali
Hi All, How to convert Dstream of JsonObject to Dataframe in spark 2.1.0? That JsonObject is from Gson Library. Thanks!

Re: How to convert Dstream of JsonObject to Dataframe in spark 2.1.0?

2017-04-24 Thread kant kodali
Thanks sam! On Mon, Apr 24, 2017 at 1:50 AM, Sam Elamin wrote: > you have 2 options > 1 )Clean ->Write your own parser to through each property and create a > dataset > 2) Hacky but simple -> Convert to json string then read in using > spark.read.json(jsonString) > >

Re: Arraylist is empty after JavaRDD.foreach

2017-04-24 Thread Michael Armbrust
Foreach runs on the executors and so is not able to modify an array list that is only present on the driver. You should just call collectAsList on the DataFrame. On Mon, Apr 24, 2017 at 10:36 AM, Devender Yadav < devender.ya...@impetus.co.in> wrote: > Hi All, > > > I am using Spark 1.6.2 and

Re: splitting a huge file

2017-04-24 Thread Steve Loughran
> On 21 Apr 2017, at 19:36, Paul Tremblay wrote: > > We are tasked with loading a big file (possibly 2TB) into a data warehouse. > In order to do this efficiently, we need to split the file into smaller files. > > I don't believe there is a way to do this with Spark,

Re: Questions related to writing data to S3

2017-04-24 Thread Steve Loughran
On 23 Apr 2017, at 19:49, Richard Hanson > wrote: I have a streaming job which writes data to S3. I know there are saveAs functions helping write data to S3. But it bundles all elements then writes out to S3. use Hadoop 2.8.x binaries and

Spark diclines mesos offers

2017-04-24 Thread Pavel Plotnikov
Hi, everyone! I run spark 2.1.0 jobs on the top of Mesos cluster in coarse-grained mode with dynamic resource allocation. And sometimes spark mesos scheduler declines mesos offers despite the fact that not all available resources were used (I have less workers than the possible maximum) and the

Re: Spark Mlib - java.lang.OutOfMemoryError: Java heap space

2017-04-24 Thread Selvam Raman
This is where job going out of memory 17/04/24 10:09:22 INFO TaskSetManager: Finished task 122.0 in stage 1.0 (TID 356) in 4260 ms on ip-...-45.dev (124/234) 17/04/24 10:09:26 INFO BlockManagerInfo: Removed taskresult_361 on ip-10...-185.dev:36974 in memory (size: 5.2 MB, free: 8.5 GB) 17/04/24

Spark Mlib - java.lang.OutOfMemoryError: Java heap space

2017-04-24 Thread Selvam Raman
Hi, I have 1 master and 4 slave node. Input data size is 14GB. Slave Node config : 32GB Ram,16 core I am trying to train word embedding model using spark. It is going out of memory. To train 14GB of data how much memory do i require?. I have givem 20gb per executor but below shows it is using

Spark registered view in "Future" - View changes updated in "Future" are lost in main thread

2017-04-24 Thread Hemanth Gudela
Hi, I’m trying to write a background thread using “Future” which would periodically re-register a view with latest data from underlying database table. However, the data changes updated in “Future” thread are lost in main thread. In the below code, 1. In the beginning, registered view

Re: how to add new column using regular expression within pyspark dataframe

2017-04-24 Thread Yan Facai
Don't use udf, as `minute` and `unix_timestamp` are native method of spark.sql. scala> df.withColumn("minute", minute(unix_timestamp($"str", "HH'h'mm'm'").cast("timestamp"))).show On Tue, Apr 25, 2017 at 7:55 AM, Zeming Yu wrote: > I tried this, but doesn't seem to

Re: one hot encode a column of vector

2017-04-24 Thread Yan Facai
How about using countvectorizer? http://spark.apache.org/docs/latest/ml-features.html#countvectorizer On Tue, Apr 25, 2017 at 9:31 AM, Zeming Yu wrote: > how do I do one hot encode on a column of array? e.g. ['TG', 'CA'] > > > FYI here's my code for one hot encoding

How to maintain order of key-value in DataFrame same as JSON?

2017-04-24 Thread Devender Yadav
Hi All, Sample JSON data: {"name": "dev","salary": 100,"occupation": "engg","address": "noida"} {"name": "karthik","salary": 200,"occupation": "engg","address": "blore"} Spark Java code: DataFrame df = sqlContext.read().json(jsonPath); df.printSchema(); df.show(false); Output: root |--

Re: How to maintain order of key-value in DataFrame same as JSON?

2017-04-24 Thread Hemanth Gudela
Hi, One option to use if you can is to force df to use the schema order you prefer like this. DataFrame df = sqlContext.read().json(jsonPath).select("name","salary","occupation","address") -Hemanth From: Devender Yadav Date: Monday, 24 April 2017 at 15.45 To:

Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-24 Thread Gene Pang
As Vincent mentioned, Alluxio helps with sharing data across different Spark contexts. This blog post about Spark dataframes and Alluxio discusses that use case . Thanks, Gene On Sat, Apr 22, 2017 at 2:14 AM, vincent gromakowski <

Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-24 Thread Hemanth Gudela
Hello Gene, Thanks, but Alluxio did not solve my spark streaming use case because my source parquet files in Alluxio in-memory are not ”appended” but are periodically being ”overwritten” due to the nature of business need. Spark jobs fail when trying to read parquet files at the same time when

Re: How to maintain order of key-value in DataFrame same as JSON?

2017-04-24 Thread Devender Yadav
Thanks Hemanth for a quick reply. From: Hemanth Gudela Sent: Monday, April 24, 2017 6:37:48 PM To: Devender Yadav; user@spark.apache.org Subject: Re: How to maintain order of key-value in DataFrame same as JSON? Hi, One option to use

Re: Spark diclines mesos offers

2017-04-24 Thread Michael Gummelt
Have you run with debug logging? There are some hints in the debug logs: https://github.com/apache/spark/blob/branch-2.1/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L316 On Mon, Apr 24, 2017 at 4:53 AM, Pavel Plotnikov <

Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-24 Thread vincent gromakowski
Look at Spark jobserver namedRDD that are supposed to be thread safe... 2017-04-24 16:01 GMT+02:00 Hemanth Gudela : > Hello Gene, > > > > Thanks, but Alluxio did not solve my spark streaming use case because my > source parquet files in Alluxio in-memory are not

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Dongjin Lee
Sumit, I think the post below is describing the very case of you. https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/ Regards, Dongjin -- Dongjin Lee Software developer in Line+. So interested in massive-scale machine learning. facebook:

Re: pyspark vector

2017-04-24 Thread Peyman Mohajerian
setVocabSize On Mon, Apr 24, 2017 at 5:36 PM, Zeming Yu wrote: > Hi all, > > Beginner question: > > what does the 3 mean in the (3,[0,1,2],[1.0,1.0,1.0])? > > https://spark.apache.org/docs/2.1.0/ml-features.html > > id | texts | vector >

Re: udf that handles null values

2017-04-24 Thread Zeming Yu
Thank you both! Here's the code that's working now. It's a bit hard to read due to so many functions. Any idea how I can improve the readability? from pyspark.sql.functions import trim, when, from_unixtime, unix_timestamp, minute, hour duration_test = flight2.select("stop_duration1")

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Chawla,Sumit
Thanks a lot @ Dongjin, @Ryan I am using Spark 1.6. I agree with your assesment Ryan. Further investigation seemed to suggest that our cluster was probably at 100% capacity at that point of time. Though tasks were failing on that slave, still it was accepting the task, and task retries