Re: Spark streaming to kafka exactly once

2017-03-23 Thread Maurin Lenglart
park but you'll probably have a > better time handling duplicates in the service that reads from Kafka. > > On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart > wrote: >> >> Hi, >> we are trying to build a spark streaming solution that su

Spark streaming to kafka exactly once

2017-03-22 Thread Maurin Lenglart
Hi, we are trying to build a spark streaming solution that subscribe and push to kafka. But we are running into the problem of duplicates events. Right now, I am doing a “forEachRdd” and loop over the message of each partition and send those message to kafka. Is there any good way of solving tha

.tar.bz2 in spark

2016-12-08 Thread Maurin Lenglart
Hi, I am trying to load a json file compress in .tar.bz2 but spark throw an error. I am using pyspark with spark 1.6.2. (Cloudera 5.9) What will be the best way to handle that? I don’t want to have a non-spark job that will just uncompressed the data… thanks

SizeEstimator for python

2016-08-15 Thread Maurin Lenglart
Hi, Is there a way to estimate the size of a dataframe in python? Something similar to https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/util/SizeEstimator.html ? thanks

dynamic coalesce to pick file size

2016-07-26 Thread Maurin Lenglart
Hi, I am doing a Sql query that return a Dataframe. Then I am writing the result of the query using “df.write”, but the result get written in a lot of different small files (~100 of 200 ko). So now I am doing a “.coalesce(2)” before the write. But the number “2” that I picked is static, is there

Re: Spark Website

2016-07-13 Thread Maurin Lenglart
Same here From: Benjamin Kim Date: Wednesday, July 13, 2016 at 11:47 AM To: manish ranjan Cc: user Subject: Re: Spark Website It takes me to the directories instead of the webpage. On Jul 13, 2016, at 11:45 AM, manish ranjan mailto:cse1.man...@gmail.com>> wrote: working for me. What do you

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
7; AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 8 seconds. thanks From: Mich Talebzadeh mailto:mich.talebza...@gmail.com>> Date: Sunday, April 17, 2016 at 2:52 PM To: maurin lenglart mailto:mau...@cuberonlabs.com>> Cc: "user @spark"

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
eh mailto:mich.talebza...@gmail.com>> Date: Sunday, April 17, 2016 at 2:22 PM To: maurin lenglart mailto:mau...@cuberonlabs.com>> Cc: "user @spark" mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc is really slow Hi Maurin, Have you tried to create

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
; Date: Saturday, April 16, 2016 at 4:14 AM To: maurin lenglart mailto:mau...@cuberonlabs.com>>, "user @spark" mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc is really slow Apologies that should read desc formatted Example for table dummy hive

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
using the latest release of cloudera and I didn’t modified any version. Do you think that I should try to manually update hive ? thanks From: Jörn Franke mailto:jornfra...@gmail.com>> Date: Saturday, April 16, 2016 at 1:02 AM To: maurin lenglart mailto:mau...@cuberonlabs.com>> Cc:

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
u for your answer. From: Mich Talebzadeh mailto:mich.talebza...@gmail.com>> Date: Saturday, April 16, 2016 at 12:32 AM To: maurin lenglart mailto:mau...@cuberonlabs.com>> Cc: "user @spark" mailto:user@spark.apache.org>> Subject: Re: orc vs parquet aggregation, orc

orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Maurin Lenglart
Hi, I am executing one query : “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >= '2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2” My table was created something like : CREATE TA

Re: alter table add columns aternatives or hive refresh

2016-04-15 Thread Maurin Lenglart
he df * Then I use df.insertInto myTable I also migrated for parquet to ORC, not sure if this have an impact or not. Thanks you for our help. From: Mich Talebzadeh mailto:mich.talebza...@gmail.com>> Date: Sunday, April 10, 2016 at 11:54 PM To: maurin lenglart mailto:mau...@cuberonlabs.com&

Re: alter table add columns aternatives or hive refresh

2016-04-11 Thread Maurin Lenglart
I will try that during the next w-e. Thanks you for your answers. From: Mich Talebzadeh mailto:mich.talebza...@gmail.com>> Date: Sunday, April 10, 2016 at 11:54 PM To: maurin lenglart mailto:mau...@cuberonlabs.com>> Cc: "user @spark" mailto:user@spark.apache.org>>

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Maurin Lenglart
s Hive. You can add new rows including values for the new column but cannot update the null values. Will this work for you? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://taleb

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Maurin Lenglart
options that will allow me not to move TB of data everyday? Thanks for you answer From: Mich Talebzadeh mailto:mich.talebza...@gmail.com>> Date: Sunday, April 10, 2016 at 3:41 AM To: maurin lenglart mailto:mau...@cuberonlabs.com>> Cc: "user@spark.apache.org<mailto:u

alter table add columns aternatives or hive refresh

2016-04-09 Thread Maurin Lenglart
Hi, I am trying to add columns to table that I created with the “saveAsTable” api. I update the columns using sqlContext.sql(‘alter table myTable add columns (mycol string)’). The next time I create a df and save it in the same table, with the new columns I get a : “ParquetRelation requires that

Re: Sample sql query using pyspark

2016-03-01 Thread Maurin Lenglart
1) Result = sample.groupBy("Category").agg(sum("bookings"), sum("dealviews”)) Thanks for your answer. From: James Barney mailto:jamesbarne...@gmail.com>> Date: Tuesday, March 1, 2016 at 7:01 AM To: maurin lenglart mailto:mau...@cuberonlabs.com>> Cc: &qu

Sample sql query using pyspark

2016-03-01 Thread Maurin Lenglart
Hi, I am trying to get a sample of a sql query in to make the query run faster. My query look like this : SELECT `Category` as `Category`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM groupon_dropbox WHERE `event_date` >= '2015-11-14' AND `event_date` <= '2016-02-19' GROUP B

_metada file throwing an "GC overhead limit exceeded" after a write

2016-02-12 Thread Maurin Lenglart
Hi, I am currently using spark in python. I have my master, worker and driver on the same machine in different dockers. I am using spark 1.6. The configuration that I am using look like this : CONFIG["spark.executor.memory"] = "100g" CONFIG["spark.executor.cores"] = "11" CONFIG["spark.cores.max"