park but you'll probably have a
> better time handling duplicates in the service that reads from Kafka.
>
> On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart
> wrote:
>>
>> Hi,
>> we are trying to build a spark streaming solution that su
Hi,
we are trying to build a spark streaming solution that subscribe and push to
kafka.
But we are running into the problem of duplicates events.
Right now, I am doing a “forEachRdd” and loop over the message of each
partition and send those message to kafka.
Is there any good way of solving tha
Hi,
I am trying to load a json file compress in .tar.bz2 but spark throw an error.
I am using pyspark with spark 1.6.2. (Cloudera 5.9)
What will be the best way to handle that?
I don’t want to have a non-spark job that will just uncompressed the data…
thanks
Hi,
Is there a way to estimate the size of a dataframe in python?
Something similar to
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/util/SizeEstimator.html
?
thanks
Hi,
I am doing a Sql query that return a Dataframe. Then I am writing the result of
the query using “df.write”, but the result get written in a lot of different
small files (~100 of 200 ko). So now I am doing a “.coalesce(2)” before the
write.
But the number “2” that I picked is static, is there
Same here
From: Benjamin Kim
Date: Wednesday, July 13, 2016 at 11:47 AM
To: manish ranjan
Cc: user
Subject: Re: Spark Website
It takes me to the directories instead of the webpage.
On Jul 13, 2016, at 11:45 AM, manish ranjan
mailto:cse1.man...@gmail.com>> wrote:
working for me. What do you
7; AND `event_date`
<= '2016-04-02' GROUP BY `event_date` LIMIT 2”) take 8 seconds.
thanks
From: Mich Talebzadeh
mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:52 PM
To: maurin lenglart mailto:mau...@cuberonlabs.com>>
Cc: "user @spark"
eh
mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 17, 2016 at 2:22 PM
To: maurin lenglart mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow
Hi Maurin,
Have you tried to create
;
Date: Saturday, April 16, 2016 at 4:14 AM
To: maurin lenglart mailto:mau...@cuberonlabs.com>>,
"user @spark" mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc is really slow
Apologies that should read
desc formatted
Example for table dummy
hive
using the latest release of cloudera and I didn’t modified any version. Do
you think that I should try to manually update hive ?
thanks
From: Jörn Franke mailto:jornfra...@gmail.com>>
Date: Saturday, April 16, 2016 at 1:02 AM
To: maurin lenglart mailto:mau...@cuberonlabs.com>>
Cc:
u for your answer.
From: Mich Talebzadeh
mailto:mich.talebza...@gmail.com>>
Date: Saturday, April 16, 2016 at 12:32 AM
To: maurin lenglart mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" mailto:user@spark.apache.org>>
Subject: Re: orc vs parquet aggregation, orc
Hi,
I am executing one query :
“SELECT `event_date` as `event_date`,sum(`bookings`) as
`bookings`,sum(`dealviews`) as `dealviews` FROM myTable WHERE `event_date` >=
'2016-01-06' AND `event_date` <= '2016-04-02' GROUP BY `event_date` LIMIT 2”
My table was created something like :
CREATE TA
he df
* Then I use df.insertInto myTable
I also migrated for parquet to ORC, not sure if this have an impact or not.
Thanks you for our help.
From: Mich Talebzadeh
mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 10, 2016 at 11:54 PM
To: maurin lenglart mailto:mau...@cuberonlabs.com&
I will try that during the next w-e.
Thanks you for your answers.
From: Mich Talebzadeh
mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 10, 2016 at 11:54 PM
To: maurin lenglart mailto:mau...@cuberonlabs.com>>
Cc: "user @spark" mailto:user@spark.apache.org>>
s Hive. You can add new rows including values for the new column but
cannot update the null values. Will this work for you?
HTH
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com<http://taleb
options that will allow me not to move TB of data everyday?
Thanks for you answer
From: Mich Talebzadeh
mailto:mich.talebza...@gmail.com>>
Date: Sunday, April 10, 2016 at 3:41 AM
To: maurin lenglart mailto:mau...@cuberonlabs.com>>
Cc: "user@spark.apache.org<mailto:u
Hi,
I am trying to add columns to table that I created with the “saveAsTable” api.
I update the columns using sqlContext.sql(‘alter table myTable add columns
(mycol string)’).
The next time I create a df and save it in the same table, with the new columns
I get a :
“ParquetRelation
requires that
1)
Result = sample.groupBy("Category").agg(sum("bookings"), sum("dealviews”))
Thanks for your answer.
From: James Barney mailto:jamesbarne...@gmail.com>>
Date: Tuesday, March 1, 2016 at 7:01 AM
To: maurin lenglart mailto:mau...@cuberonlabs.com>>
Cc: &qu
Hi,
I am trying to get a sample of a sql query in to make the query run faster.
My query look like this :
SELECT `Category` as `Category`,sum(`bookings`) as `bookings`,sum(`dealviews`)
as `dealviews` FROM groupon_dropbox WHERE `event_date` >= '2015-11-14' AND
`event_date` <= '2016-02-19' GROUP B
Hi,
I am currently using spark in python. I have my master, worker and driver on
the same machine in different dockers. I am using spark 1.6.
The configuration that I am using look like this :
CONFIG["spark.executor.memory"] = "100g"
CONFIG["spark.executor.cores"] = "11"
CONFIG["spark.cores.max"
20 matches
Mail list logo