date:20181108

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Xiao Li

Try to clear your browsing data or use a different web browser.

Enjoy it,

Xiao

On Thu, Nov 8, 2018 at 4:15 PM Reynold Xin  wrote:

> Do you have a cached copy? I see it here
>
> http://spark.apache.org/downloads.html
>
>
>
> On Thu, Nov 8, 2018 at 4:12 PM Li Gao  wrote:
>
>> this is wonderful !
>> I noticed the official spark download site does not have 2.4 download
>> links yet.
>>
>> On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde > wrote:
>>
>>> Great news.. thank you very much!
>>>
>>> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com wrote:
>>>
 Awesome!

 On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji 
 wrote:

> Indeed!
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
> wrote:
>
> Finally, thank you all. Especially, thanks to the release manager,
> Wenchen!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan 
> wrote:
>
>> + user list
>>
>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan 
>> wrote:
>>
>>> resend
>>>
>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
>>> wrote:
>>>

 -- Forwarded message -
 From: Wenchen Fan 
 Date: Thu, Nov 8, 2018 at 10:55 PM
 Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
 To: Spark dev list 

 Hi all,

 Apache Spark 2.4.0 is the fifth release in the 2.x line. This
 release adds Barrier Execution Mode for better integration with deep
 learning frameworks, introduces 30+ built-in and higher-order 
 functions to
 deal with complex data type easier, improves the K8s integration, along
 with experimental Scala 2.12 support. Other major updates include the
 built-in Avro data source, Image data source, flexible streaming sinks,
 elimination of the 2GB block size limitation during transfer, Pandas 
 UDF
 improvements. In addition, this release continues to focus on 
 usability,
 stability, and polish while resolving around 1100 tickets.

 We'd like to thank our contributors and users for their
 contributions and early feedback to this release. This release would 
 not
 have been possible without you.

 To download Spark 2.4.0, head over to the download page:
 http://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-2-4-0.html

 Thanks,
 Wenchen

 PS: If you see any issues with the release notes, webpage or
 published artifacts, please contact me directly off-list.

>>>

-- 
[image: Spark+AI Summit North America 2019]

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Reynold Xin

Do you have a cached copy? I see it here

http://spark.apache.org/downloads.html



On Thu, Nov 8, 2018 at 4:12 PM Li Gao  wrote:

> this is wonderful !
> I noticed the official spark download site does not have 2.4 download
> links yet.
>
> On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde  wrote:
>
>> Great news.. thank you very much!
>>
>> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com wrote:
>>
>>> Awesome!
>>>
>>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:
>>>
 Indeed!

 Sent from my iPhone
 Pardon the dumb thumb typos :)

 On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
 wrote:

 Finally, thank you all. Especially, thanks to the release manager,
 Wenchen!

 Bests,
 Dongjoon.


 On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan 
 wrote:

> + user list
>
> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan 
> wrote:
>
>> resend
>>
>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
>> wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Wenchen Fan 
>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>> To: Spark dev list 
>>>
>>>
>>> Hi all,
>>>
>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This
>>> release adds Barrier Execution Mode for better integration with deep
>>> learning frameworks, introduces 30+ built-in and higher-order functions 
>>> to
>>> deal with complex data type easier, improves the K8s integration, along
>>> with experimental Scala 2.12 support. Other major updates include the
>>> built-in Avro data source, Image data source, flexible streaming sinks,
>>> elimination of the 2GB block size limitation during transfer, Pandas UDF
>>> improvements. In addition, this release continues to focus on usability,
>>> stability, and polish while resolving around 1100 tickets.
>>>
>>> We'd like to thank our contributors and users for their
>>> contributions and early feedback to this release. This release would not
>>> have been possible without you.
>>>
>>> To download Spark 2.4.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> PS: If you see any issues with the release notes, webpage or
>>> published artifacts, please contact me directly off-list.
>>>
>>
>>>
>>>
>>>
>>>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Li Gao

this is wonderful !
I noticed the official spark download site does not have 2.4 download links
yet.

On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde  Great news.. thank you very much!
>
> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com wrote:
>
>> Awesome!
>>
>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:
>>
>>> Indeed!
>>>
>>> Sent from my iPhone
>>> Pardon the dumb thumb typos :)
>>>
>>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
>>> wrote:
>>>
>>> Finally, thank you all. Especially, thanks to the release manager,
>>> Wenchen!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>>>
 + user list

 On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:

> resend
>
> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
> wrote:
>
>>
>>
>> -- Forwarded message -
>> From: Wenchen Fan 
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>> To: Spark dev list 
>>
>>
>> Hi all,
>>
>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>> adds Barrier Execution Mode for better integration with deep learning
>> frameworks, introduces 30+ built-in and higher-order functions to deal 
>> with
>> complex data type easier, improves the K8s integration, along with
>> experimental Scala 2.12 support. Other major updates include the built-in
>> Avro data source, Image data source, flexible streaming sinks, 
>> elimination
>> of the 2GB block size limitation during transfer, Pandas UDF 
>> improvements.
>> In addition, this release continues to focus on usability, stability, and
>> polish while resolving around 1100 tickets.
>>
>> We'd like to thank our contributors and users for their contributions
>> and early feedback to this release. This release would not have been
>> possible without you.
>>
>> To download Spark 2.4.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>
>> Thanks,
>> Wenchen
>>
>> PS: If you see any issues with the release notes, webpage or
>> published artifacts, please contact me directly off-list.
>>
>
>>
>>
>>
>>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Swapnil Shinde

Great news.. thank you very much!

On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com wrote:

> Awesome!
>
> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:
>
>> Indeed!
>>
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
>> wrote:
>>
>> Finally, thank you all. Especially, thanks to the release manager,
>> Wenchen!
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>>
>>> + user list
>>>
>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>>
 resend

 On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
 wrote:

>
>
> -- Forwarded message -
> From: Wenchen Fan 
> Date: Thu, Nov 8, 2018 at 10:55 PM
> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
> To: Spark dev list 
>
>
> Hi all,
>
> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
> adds Barrier Execution Mode for better integration with deep learning
> frameworks, introduces 30+ built-in and higher-order functions to deal 
> with
> complex data type easier, improves the K8s integration, along with
> experimental Scala 2.12 support. Other major updates include the built-in
> Avro data source, Image data source, flexible streaming sinks, elimination
> of the 2GB block size limitation during transfer, Pandas UDF improvements.
> In addition, this release continues to focus on usability, stability, and
> polish while resolving around 1100 tickets.
>
> We'd like to thank our contributors and users for their contributions
> and early feedback to this release. This release would not have been
> possible without you.
>
> To download Spark 2.4.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-4-0.html
>
> Thanks,
> Wenchen
>
> PS: If you see any issues with the release notes, webpage or published
> artifacts, please contact me directly off-list.
>

>
>
>
>

Spark event logging with s3a

2018-11-08 Thread David Hesson

We are trying to use spark event logging with s3a as a destination for event 
data.

We added these settings to the spark submits:

spark.eventLog.dir s3a://ourbucket/sparkHistoryServer/eventLogs
spark.eventLog.enabled true

Everything works fine with smaller jobs, and we can see the history data in the 
history server that’s also using s3a. However, when we tried a job with a few 
hundred gigs of data that goes through multiple stages, it was dying with OOM 
exception (same job works fine with spark.eventLog.enabled false)

18/10/22 23:07:22 ERROR util.Utils: uncaught error in thread SparkListenerBus, 
stopping SparkContext
java.lang.OutOfMemoryError
at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)

Full stack trace: 
https://gist.github.com/davidhesson/bd64a25f04c6bb241ec398f5383d671c

Does anyone have any insight or experience with using spark history server with 
s3a? Is this problem being caused by perhaps something else in our configs? Any 
help would be appreciated.

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Stavros Kontopoulos

Awesome!

On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:

> Indeed!
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
> wrote:
>
> Finally, thank you all. Especially, thanks to the release manager, Wenchen!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>
>> + user list
>>
>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>
>>> resend
>>>
>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>>>

 -- Forwarded message -
 From: Wenchen Fan 
 Date: Thu, Nov 8, 2018 at 10:55 PM
 Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
 To: Spark dev list 

 Hi all,

 Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
 adds Barrier Execution Mode for better integration with deep learning
 frameworks, introduces 30+ built-in and higher-order functions to deal with
 complex data type easier, improves the K8s integration, along with
 experimental Scala 2.12 support. Other major updates include the built-in
 Avro data source, Image data source, flexible streaming sinks, elimination
 of the 2GB block size limitation during transfer, Pandas UDF improvements.
 In addition, this release continues to focus on usability, stability, and
 polish while resolving around 1100 tickets.

 We'd like to thank our contributors and users for their contributions
 and early feedback to this release. This release would not have been
 possible without you.

 To download Spark 2.4.0, head over to the download page:
 http://spark.apache.org/downloads.html

 To view the release notes: https://spark.apache.org/
 releases/spark-release-2-4-0.html

 Thanks,
 Wenchen

 PS: If you see any issues with the release notes, webpage or published
 artifacts, please contact me directly off-list.

>>>

Is dataframe write blocking? what can be done for fair scheduler?

2018-11-08 Thread ramannan...@gmail.com

Hi,

I have noticed that in fair scheduler setting, if i block on dataframe write
to complete, using AwaitResult, the API call ends up returning, whereas that
is not what I intend to do as it can cause inconsistencies later in the
pipeline. Is there a way to make the dataframe write call blocking?


Regards,
Ramandeep Singh



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to increase the parallelism of Spark Streaming application？

2018-11-08 Thread JF Chen

Yes, now I have allocated 100 cores and 8 kafka partitions, and then
repartition it to 100 to feed 100 cores. In following stage I have map
action, will it also cause slow down?

Regard,
Junfeng Chen


On Thu, Nov 8, 2018 at 12:34 AM Shahbaz  wrote:

> Hi ,
>
>- Do you have adequate CPU cores allocated to handle increased
>partitions ,generally if you have Kafka partitions >=(greater than or equal
>to) CPU Cores Total (Number of Executor Instances * Per Executor Core)
>,gives increased task parallelism for reader phase.
>- However if you have too many partitions but not enough cores ,it
>would eventually slow down the reader (Ex: 100 Partitions and only 20 Total
>Cores).
>- Additionally ,the next set of transformation will have there own
>partitions ,if its involving  shuffle ,sq.shuffle.partitions then defines
>next level of parallelism ,if you are not having any data skew,then you
>should get good performance.
>
>
> Regards,
> Shahbaz
>
> On Wed, Nov 7, 2018 at 12:58 PM JF Chen  wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>

Is Dataframe write blocking?

2018-11-08 Thread Ramandeep Singh Nanda

HI,

I have some futures setup to operate in stages, where I expect one stage to
complete before another begins. I was hoping that dataframe write call is
blocking, whereas the behavior i see is that the call returns before data
is persisted. This can cause unintended consequences. I am also using fair
scheduler so that independent jobs can run in parallel.

Spark 2.3.1.

-- 
Regards,
Ramandeep Singh
http://orastack.com

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Jules Damji

Indeed! 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun  wrote:
> 
> Finally, thank you all. Especially, thanks to the release manager, Wenchen!
> 
> Bests,
> Dongjoon.
> 
> 
>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>> + user list
>> 
>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>> resend
>>> 
 On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:

 -- Forwarded message -
 From: Wenchen Fan 
 Date: Thu, Nov 8, 2018 at 10:55 PM
 Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
 To: Spark dev list 

 Hi all,

 Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds 
 Barrier Execution Mode for better integration with deep learning 
 frameworks, introduces 30+ built-in and higher-order functions to deal 
 with complex data type easier, improves the K8s integration, along with 
 experimental Scala 2.12 support. Other major updates include the built-in 
 Avro data source, Image data source, flexible streaming sinks, elimination 
 of the 2GB block size limitation during transfer, Pandas UDF improvements. 
 In addition, this release continues to focus on usability, stability, and 
 polish while resolving around 1100 tickets.

 We'd like to thank our contributors and users for their contributions and 
 early feedback to this release. This release would not have been possible 
 without you.

 To download Spark 2.4.0, head over to the download page: 
 http://spark.apache.org/downloads.html

 To view the release notes: 
 https://spark.apache.org/releases/spark-release-2-4-0.html

 Thanks,
 Wenchen

 PS: If you see any issues with the release notes, webpage or published 
 artifacts, please contact me directly off-list.

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Dongjoon Hyun

Finally, thank you all. Especially, thanks to the release manager, Wenchen!

Bests,
Dongjoon.


On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:

> + user list
>
> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>
>> resend
>>
>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Wenchen Fan 
>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>> To: Spark dev list 
>>>
>>>
>>> Hi all,
>>>
>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>>> adds Barrier Execution Mode for better integration with deep learning
>>> frameworks, introduces 30+ built-in and higher-order functions to deal with
>>> complex data type easier, improves the K8s integration, along with
>>> experimental Scala 2.12 support. Other major updates include the built-in
>>> Avro data source, Image data source, flexible streaming sinks, elimination
>>> of the 2GB block size limitation during transfer, Pandas UDF improvements.
>>> In addition, this release continues to focus on usability, stability, and
>>> polish while resolving around 1100 tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 2.4.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> PS: If you see any issues with the release notes, webpage or published
>>> artifacts, please contact me directly off-list.
>>>
>>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan

+ user list

On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:

> resend
>
> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>
>>
>>
>> -- Forwarded message -
>> From: Wenchen Fan 
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>> To: Spark dev list 
>>
>>
>> Hi all,
>>
>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>> adds Barrier Execution Mode for better integration with deep learning
>> frameworks, introduces 30+ built-in and higher-order functions to deal with
>> complex data type easier, improves the K8s integration, along with
>> experimental Scala 2.12 support. Other major updates include the built-in
>> Avro data source, Image data source, flexible streaming sinks, elimination
>> of the 2GB block size limitation during transfer, Pandas UDF improvements.
>> In addition, this release continues to focus on usability, stability, and
>> polish while resolving around 1100 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 2.4.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>
>> Thanks,
>> Wenchen
>>
>> PS: If you see any issues with the release notes, webpage or published
>> artifacts, please contact me directly off-list.
>>
>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Marcelo Vanzin

+user@

>> -- Forwarded message -
>> From: Wenchen Fan 
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>> To: Spark dev list 
>>
>>
>> Hi all,
>>
>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds 
>> Barrier Execution Mode for better integration with deep learning frameworks, 
>> introduces 30+ built-in and higher-order functions to deal with complex data 
>> type easier, improves the K8s integration, along with experimental Scala 
>> 2.12 support. Other major updates include the built-in Avro data source, 
>> Image data source, flexible streaming sinks, elimination of the 2GB block 
>> size limitation during transfer, Pandas UDF improvements. In addition, this 
>> release continues to focus on usability, stability, and polish while 
>> resolving around 1100 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and 
>> early feedback to this release. This release would not have been possible 
>> without you.
>>
>> To download Spark 2.4.0, head over to the download page: 
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes: 
>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>
>> Thanks,
>> Wenchen
>>
>> PS: If you see any issues with the release notes, webpage or published 
>> artifacts, please contact me directly off-list.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[no subject]

2018-11-08 Thread JF Chen

I am working on a spark streaming application, and I want it to read
configuration from mongodb every hour, where the batch interval is 10
minutes.
Is it practicable? As I know spark streaming batch are related to the
Dstream, how to implement this function which seems not related to dstream
data?


Regard,
Junfeng Chen

StorageLevel: OffHeap

2018-11-08 Thread Jack Kolokasis


Hello everyone,
    I am running a simple word count in Spark and I persist my RDDs 
using StorageLevel.OFF_HEAP. While I am running the application, i see 
through the Spark Web UI that are persisted in Disk.  Why this happen??

Can anyone tell me how off heap storage Level work ??

Thanks for your help,
--Iacovos Kolokasis

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to increase the parallelism of Spark Streaming application？

2018-11-08 Thread JF Chen

Hi,
I have test it on my production environment, and I find a strange thing.
After I set the kafka partition to 100, some tasks are executed very fast,
but some are slow. The slow ones cost double time than fast ones(from event
timeline). However, I have checked the consumer offsets, the data amount
for each task should be similar, so it should be no unbalance problem.
Any one have some good idea?

Regard,
Junfeng Chen


On Thu, Nov 8, 2018 at 12:34 AM Shahbaz  wrote:

> Hi ,
>
>- Do you have adequate CPU cores allocated to handle increased
>partitions ,generally if you have Kafka partitions >=(greater than or equal
>to) CPU Cores Total (Number of Executor Instances * Per Executor Core)
>,gives increased task parallelism for reader phase.
>- However if you have too many partitions but not enough cores ,it
>would eventually slow down the reader (Ex: 100 Partitions and only 20 Total
>Cores).
>- Additionally ,the next set of transformation will have there own
>partitions ,if its involving  shuffle ,sq.shuffle.partitions then defines
>next level of parallelism ,if you are not having any data skew,then you
>should get good performance.
>
>
> Regards,
> Shahbaz
>
> On Wed, Nov 7, 2018 at 12:58 PM JF Chen  wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>

Re: How to increase the parallelism of Spark Streaming application？

2018-11-08 Thread JF Chen

Memory is not a big problem for me... SO  no any other bad effect?

Regard,
Junfeng Chen


On Wed, Nov 7, 2018 at 4:51 PM Michael Shtelma  wrote:

> If you configure to many Kafka partitions, you can run into memory issues.
> This will increase memory requirements for spark job a lot.
>
> Best,
> Michael
>
>
> On Wed, Nov 7, 2018 at 8:28 AM JF Chen  wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

Spark event logging with s3a

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

Is dataframe write blocking? what can be done for fair scheduler?

Re: How to increase the parallelism of Spark Streaming application？

Is Dataframe write blocking?

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

[no subject]

StorageLevel: OffHeap

Re: How to increase the parallelism of Spark Streaming application？

Re: How to increase the parallelism of Spark Streaming application？

17 matches

Site Navigation

Mail list logo

Footer information