Re: Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
If anyone wants to watch the recording: https://www.youtube.com/watch?v=lugG_2QU6YU I'll do one next week as well - March 16th @ 11am - https://www.youtube.com/watch?v=pXzVtEUjrLc On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau wrote: > Hi folks, > > If your curious in

Re: Upgrades of streaming jobs

2018-03-09 Thread Tathagata Das
Yes, all checkpoints are forward compatible. However, you do need to restart the query if you want to update the code of the query. This downtime can be in less than a second (if you just restart the query without stopping the application/Spark driver) or 10s of seconds (if you have to stop the

Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
Hi folks, If your curious in learning more about how Spark is developed, I’m going to expirement doing a live code review where folks can watch and see how that part of our process works. I have two volunteers already for having their PRs looked at live, and if you have a Spark PR your working on

Issue with using Generalized Linear Regression for Logistic Regression modeling

2018-03-09 Thread FireFly
The Logistic Regression (LR) offered by Spark has rather limited model statistics output. I would like to have access to q-value, AIC, standard error etc. Generalized Linear Regression (GLR) does offer these statistics in the model output, and can be used as as LR if one specifies

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Vadim Semenov
But overall, I think the original approach is not correct. If you get a single file in 10s GB, the approach is probably must be reworked. I don't see why you can't just write multiple CSV files using Spark, and then concatenate them without Spark On Fri, Mar 9, 2018 at 10:02 AM, Vadim Semenov

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Silvio Fiorito
Given you start with ~250MB but end up with 58GB seems like you’re generating quite a bit of data. Whether you use coalesce or repartition, still writing out 58GB with one core is going to take a while. Using Spark to do pre-processing but output a single file is not going to be very

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Vadim Semenov
You can use `.checkpoint` for that `df.sort(…).coalesce(1).write...` — `coalesce` will make `sort` to have only one partition, so sorting will take a lot of time `df.sort(…).repartition(1).write...` — `repartition` will add an explicit stage, but sorting will be lost, since it's a repartition

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Deepak Sharma
I would suggest repartioning it to reasonable partitions may ne 500 and save it to some intermediate working directory . Finally read all the files from this working dir and then coalesce as 1 and save to final location. Thanks Deepak On Fri, Mar 9, 2018, 20:12 Vadim Semenov

Contextual bandits

2018-03-09 Thread ey-chih chow
Hi, Does Spark MLLIB support Contextual Bandit? How can we use Spark MLLIB to implement Contextual Bandit? Thanks. Best regards, Ey-Chih -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Vadim Semenov
because `coalesce` gets propagated further up in the DAG in the last stage, so your last stage only has one task. You need to break your DAG so your expensive operations would be in a previous stage before the stage with `.coalesce(1)` On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim <

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Md. Rezaul Karim
Hi All, Thanks for prompt response. Really appreciated! Here's a few more info: 1. Spark version: 2.3.0 2. vCore: 8 3. RAM: 32GB 4. Deploy mode: Spark standalone *Operation performed:* I did transformations using StringIndexer on some columns and null imputations. That's all. *Why writing back

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Teemu Heikkilä
Sounds like you’re doing something else than just writing the same file back to disk, what your preprocessing consists? Sometimes you can save lot’s of space by using other formats but now we’re talking over 200x increase in file size so depending on the transformations for the data you might

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Gourav Dutta
Which version of spark are you using? The reason for asking this question is from Spark 2.x csv is internal library so no need to save it with com.databricks.spark.csv package. Moreover, taking time for this simple task is very much dependent upon your cluster health. Could you please provide

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Matteo Durighetto
Hello, try to use parquet format with compression ( like snappy or lz4 ) so the produced files will be smaller and it will generate less i/o. Moreover normally parquet is more faster than csv format in reading for further operations . Another possible format is ORC file. Kind Regards Matteo

Connection SparkStreaming with SchemaRegistry

2018-03-09 Thread Guillermo Ortiz
I'm trying to integrate with schemaRegistry and SparkStreaming. By the moment I want to use GenericRecords. It seems that my producer works and new schemas are published in _schemas topic. When I try to read with my Consumer, I'm not able to deserialize the data. How could I say to Spark that

Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Md. Rezaul Karim
Dear All, I have a tiny CSV file, which is around 250MB. There are only 30 columns in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an another CSV file on disk for later usage. However, I'm getting pissed off as writing the resultant DataFrame is taking too long, which is