OutOfDirectMemoryError for Spark 2.2

2018-03-05 Thread Chawla,Sumit
Hi All

I have a job which processes a large dataset.  All items in the dataset are
unrelated.  To save on cluster resources,  I process these items in
chunks.  Since chunks are independent of each other,  I start and shut down
the spark context for each chunk.  This allows me to keep DAG smaller and
not retry the entire DAG in case of failures.   This mechanism used to work
fine with Spark 1.6.  Now,  as we have moved to 2.2,  the job started
failing with OutOfDirectMemoryError error.

2018-03-03 22:00:59,687 WARN  [rpc-server-48-1]
server.TransportChannelHandler
(TransportChannelHandler.java:exceptionCaught(78)) - Exception in
connection from /10.66.73.27:60374

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 8388608
byte(s) of direct memory (used: 1023410176, max: 1029177344)

at
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:506)

at
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:460)

at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)

at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:690)

at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237)

at io.netty.buffer.PoolArena.allocate(PoolArena.java:213)

at io.netty.buffer.PoolArena.allocate(PoolArena.java:141)

at
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)

at
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)

at
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)

at
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)

at
io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)

at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)

at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:564)

I got some clue on what is causing this from
https://github.com/netty/netty/issues/6343,  However I am not able to add
up numbers on what is causing 1 GB of Direct Memory to fill up.

Output from jmap


7: 22230 1422720 io.netty.buffer.PoolSubpage

12: 1370 804640 io.netty.buffer.PoolSubpage[]

41: 3600 144000 io.netty.buffer.PoolChunkList

98: 1440 46080 io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache

113: 300 40800 io.netty.buffer.PoolArena$HeapArena

114: 300 40800 io.netty.buffer.PoolArena$DirectArena

192: 198 15840 io.netty.buffer.PoolChunk

274: 120 8320 io.netty.buffer.PoolThreadCache$MemoryRegionCache[]

406: 120 3840 io.netty.buffer.PoolThreadCache$NormalMemoryRegionCache

422: 72 3552 io.netty.buffer.PoolArena[]

458: 30 2640 io.netty.buffer.PooledUnsafeDirectByteBuf

500: 36 2016 io.netty.buffer.PooledByteBufAllocator

529: 32 1792 io.netty.buffer.UnpooledUnsafeHeapByteBuf

589: 20 1440 io.netty.buffer.PoolThreadCache

630: 37 1184 io.netty.buffer.EmptyByteBuf

703: 36 864 io.netty.buffer.PooledByteBufAllocator$PoolThreadLocalCache

852: 22 528 io.netty.buffer.AdvancedLeakAwareByteBuf

889: 10 480 io.netty.buffer.SlicedAbstractByteBuf

917: 8 448 io.netty.buffer.UnpooledHeapByteBuf

1018: 20 320 io.netty.buffer.PoolThreadCache$1

1305: 4 128 io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry

1404: 1 80 io.netty.buffer.PooledUnsafeHeapByteBuf

1473: 3 72 io.netty.buffer.PoolArena$SizeClass

1529: 1 64 io.netty.buffer.AdvancedLeakAwareCompositeByteBuf

1541: 2 64 io.netty.buffer.CompositeByteBuf$Component

1568: 1 56 io.netty.buffer.CompositeByteBuf

1896: 1 32 io.netty.buffer.PoolArena$SizeClass[]

2042: 1 24 io.netty.buffer.PooledUnsafeDirectByteBuf$1

2046: 1 24 io.netty.buffer.UnpooledByteBufAllocator

2051: 1 24 io.netty.buffer.PoolThreadCache$MemoryRegionCache$1

2078: 1 24 io.netty.buffer.PooledHeapByteBuf$1

2135: 1 24 io.netty.buffer.PooledUnsafeHeapByteBuf$1

2302: 1 16 io.netty.buffer.ByteBufUtil$1

2769: 1 16
io.netty.util.internal.__matchers__.io.netty.buffer.ByteBufMatcher



My Driver machine has 32 CPUs,  and as of now i have 15 machines in my
cluster.   As of now, the error happens on processing 5th or 6th chunk.  I
suspect the error is dependent on number of Executors and would happen
early if we add more executors.


I am trying to come up an explanation of what is filling up the Direct
Memory and how to quanitfy it as factor of Number of Executors.  Our
cluster is shared cluster,  And we need to understand how much Driver
Memory to allocate for most of the jobs.





Regards
Sumit Chawla


Re: Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
Oh, I didn't know about that. I think that will do the trick.

Would you happen to know what setting I need? I'm looking here
, but
it's a bit overwhelming. I'm basically looking for a way to set the overall
Ivy log level to WARN or higher.

Nick

On Mon, Mar 5, 2018 at 2:11 PM Bryan Cutler  wrote:

> Hi Nick,
>
> Not sure about changing the default to warnings only because I think some
> might find the resolution output useful, but you can specify your own ivy
> settings file with "spark.jars.ivySettings" to point to your
> ivysettings.xml file.  Would that work for you to configure it there?
>
> Bryan
>
> On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I couldn’t get an answer anywhere else, so I thought I’d ask here.
>>
>> Is there a way to silence the messages that come from Ivy when you call
>> spark-submit with --packages? (For the record, I asked this question on
>> Stack Overflow .)
>>
>> Would it be a good idea to configure Ivy by default to only output
>> warnings or errors?
>>
>> Nick
>> ​
>>
>
>


Re: Welcoming some new committers

2018-03-05 Thread Seth Hendrickson
Thanks all! :D

On Mon, Mar 5, 2018 at 9:01 AM, Bryan Cutler  wrote:

> Thanks everyone, this is very exciting!  I'm looking forward to working
> with you all and helping out more in the future.  Also, congrats to the
> other committers as well!!
>


Re: [Spark][Scheduler] Spark DAGScheduler scheduling performance hindered on JobSubmitted Event

2018-03-05 Thread Reynold Xin
Rather than using a separate thread pool, perhaps we can just move the prep
code to the call site thread?


On Sun, Mar 4, 2018 at 11:15 PM, Ajith shetty 
wrote:

> DAGScheduler becomes a bottleneck in cluster when multiple JobSubmitted
> events has to be processed as DAGSchedulerEventProcessLoop is single
> threaded and it will block other tasks in queue like TaskCompletion.
>
> The JobSubmitted event is time consuming depending on the nature of the
> job (Example: calculating parent stage dependencies, shuffle dependencies,
> partitions) and thus it blocks all the events to be processed.
>
>
>
> I see multiple JIRA referring to this behavior
>
> https://issues.apache.org/jira/browse/SPARK-2647
>
> https://issues.apache.org/jira/browse/SPARK-4961
>
>
>
> Similarly in my cluster some jobs partition calculation is time consuming
> (Similar to stack at SPARK-2647) hence it slows down the spark
> DAGSchedulerEventProcessLoop which results in user jobs to slowdown, even
> if its tasks are finished within seconds, as TaskCompletion Events are
> processed at a slower rate due to blockage.
>
>
>
> I think we can split a JobSubmitted Event into 2 events
>
> Step 1. JobSubmittedPreperation - Runs in separate thread on
> JobSubmission, this will involve steps org.apache.spark.scheduler.
> DAGScheduler#createResultStage
>
> Step 2. JobSubmittedExecution - If Step 1 is success, fire an event to
> DAGSchedulerEventProcessLoop and let it process output of
> org.apache.spark.scheduler.DAGScheduler#createResultStage
>
>
>
> I can see the effect of doing this may be that Job Submissions may not be
> FIFO depending on how much time Step 1 mentioned above is going to consume.
>
>
>
> Does above solution suffice for the problem described? And is there any
> other side effect of this solution?
>
>
>
> Regards
>
> Ajith
>


Re: Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Bryan Cutler
Hi Nick,

Not sure about changing the default to warnings only because I think some
might find the resolution output useful, but you can specify your own ivy
settings file with "spark.jars.ivySettings" to point to your
ivysettings.xml file.  Would that work for you to configure it there?

Bryan

On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas  wrote:

> I couldn’t get an answer anywhere else, so I thought I’d ask here.
>
> Is there a way to silence the messages that come from Ivy when you call
> spark-submit with --packages? (For the record, I asked this question on
> Stack Overflow .)
>
> Would it be a good idea to configure Ivy by default to only output
> warnings or errors?
>
> Nick
> ​
>


Spark+AI Summit 2018 - San Francisco June 4-6, 2018

2018-03-05 Thread Scott walent
Early Bird pricing ends on Friday.  Book now to save $200+

Full agenda is available: www.databricks.com/sparkaisummit


Re: Spark scala development in Sbt vs Maven

2018-03-05 Thread Anthony May
We use sbt for easy cross project dependencies with multiple scala versions
in a mono-repo for which it pretty good albeit with some quirks. As our
projects have matured and change less we moved away from cross project
dependencies but it was extremely useful early in the projects. We knew
that a lot of this was possible in maven/gradle but did not want to go
through the hackage required to get it working.

On Mon, 5 Mar 2018 at 09:49 Sean Owen  wrote:

> Spark uses Maven as the primary build, but SBT works as well. It reads the
> Maven build to some extent.
>
> Zinc incremental compilation works with Maven (with the Scala plugin for
> Maven).
>
> Myself, I prefer Maven, for some of the reasons it is the main build in
> Spark: declarative builds end up being a good thing. You want builds very
> standard. I think the flexibility of writing code to express your build
> just gives a lot of rope to hang yourself with, and recalls the old days of
> Ant builds, where no two builds you'd encounter looked alike when doing the
> same thing.
>
> If by cross publishing you mean handling different scala versions, yeah
> SBT is more aware of that. The Spark Maven build manages to handle that
> with some hacking.
>
>
> On Mon, Mar 5, 2018 at 9:56 AM Jörn Franke  wrote:
>
>> I think most of the scala development in Spark happens with sbt - in the
>> open source world.
>>
>>  However, you can do it with Gradle and Maven as well. It depends on your
>> organization etc. what is your standard.
>>
>> Some things might be more cumbersome too reach in non-sbt scala
>> scenarios, but this is more and more improving.
>>
>> > On 5. Mar 2018, at 16:47, Swapnil Shinde 
>> wrote:
>> >
>> > Hello
>> >SBT's incremental compilation was a huge plus to build spark+scala
>> applications in SBT for some time. It seems Maven can also support
>> incremental compilation with Zinc server. Considering that, I am interested
>> to know communities experience -
>> >
>> > 1. Spark documentation says SBT is being used by many contributors for
>> day to day development mainly because of incremental compilation.
>> Considering Maven is supporting incremental compilation through Zinc, do
>> contributors prefer to change from SBT to maven?
>> >
>> > 2. Any issues /learning experiences with Maven + Zinc?
>> >
>> > 3. Any other reasons to use SBT over Maven for scala development.
>> >
>> > I understand SBT has many other advantages over Maven like cross
>> version publishing etc. but incremental compilation is major need for us. I
>> am more interested to know why Spark contributors/committers prefer SBT for
>> day to day development.
>> >
>> > Any help and advice would help us to direct our evaluations in right
>> direction,
>> >
>> > Thanks
>> > Swapnil
>>
>> -
>>
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Welcoming some new committers

2018-03-05 Thread Bryan Cutler
Thanks everyone, this is very exciting!  I'm looking forward to working
with you all and helping out more in the future.  Also, congrats to the
other committers as well!!


Re: Spark scala development in Sbt vs Maven

2018-03-05 Thread Sean Owen
Spark uses Maven as the primary build, but SBT works as well. It reads the
Maven build to some extent.

Zinc incremental compilation works with Maven (with the Scala plugin for
Maven).

Myself, I prefer Maven, for some of the reasons it is the main build in
Spark: declarative builds end up being a good thing. You want builds very
standard. I think the flexibility of writing code to express your build
just gives a lot of rope to hang yourself with, and recalls the old days of
Ant builds, where no two builds you'd encounter looked alike when doing the
same thing.

If by cross publishing you mean handling different scala versions, yeah SBT
is more aware of that. The Spark Maven build manages to handle that with
some hacking.


On Mon, Mar 5, 2018 at 9:56 AM Jörn Franke  wrote:

> I think most of the scala development in Spark happens with sbt - in the
> open source world.
>
>  However, you can do it with Gradle and Maven as well. It depends on your
> organization etc. what is your standard.
>
> Some things might be more cumbersome too reach in non-sbt scala scenarios,
> but this is more and more improving.
>
> > On 5. Mar 2018, at 16:47, Swapnil Shinde 
> wrote:
> >
> > Hello
> >SBT's incremental compilation was a huge plus to build spark+scala
> applications in SBT for some time. It seems Maven can also support
> incremental compilation with Zinc server. Considering that, I am interested
> to know communities experience -
> >
> > 1. Spark documentation says SBT is being used by many contributors for
> day to day development mainly because of incremental compilation.
> Considering Maven is supporting incremental compilation through Zinc, do
> contributors prefer to change from SBT to maven?
> >
> > 2. Any issues /learning experiences with Maven + Zinc?
> >
> > 3. Any other reasons to use SBT over Maven for scala development.
> >
> > I understand SBT has many other advantages over Maven like cross version
> publishing etc. but incremental compilation is major need for us. I am more
> interested to know why Spark contributors/committers prefer SBT for day to
> day development.
> >
> > Any help and advice would help us to direct our evaluations in right
> direction,
> >
> > Thanks
> > Swapnil
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
I couldn’t get an answer anywhere else, so I thought I’d ask here.

Is there a way to silence the messages that come from Ivy when you call
spark-submit with --packages? (For the record, I asked this question on
Stack Overflow .)

Would it be a good idea to configure Ivy by default to only output warnings
or errors?

Nick
​


Re: Spark scala development in Sbt vs Maven

2018-03-05 Thread Jörn Franke
I think most of the scala development in Spark happens with sbt - in the open 
source world.

 However, you can do it with Gradle and Maven as well. It depends on your 
organization etc. what is your standard.

Some things might be more cumbersome too reach in non-sbt scala scenarios, but 
this is more and more improving.

> On 5. Mar 2018, at 16:47, Swapnil Shinde  wrote:
> 
> Hello
>SBT's incremental compilation was a huge plus to build spark+scala 
> applications in SBT for some time. It seems Maven can also support 
> incremental compilation with Zinc server. Considering that, I am interested 
> to know communities experience -
> 
> 1. Spark documentation says SBT is being used by many contributors for day to 
> day development mainly because of incremental compilation. Considering Maven 
> is supporting incremental compilation through Zinc, do contributors prefer to 
> change from SBT to maven?
> 
> 2. Any issues /learning experiences with Maven + Zinc?
> 
> 3. Any other reasons to use SBT over Maven for scala development.
> 
> I understand SBT has many other advantages over Maven like cross version 
> publishing etc. but incremental compilation is major need for us. I am more 
> interested to know why Spark contributors/committers prefer SBT for day to 
> day development.
> 
> Any help and advice would help us to direct our evaluations in right 
> direction,
> 
> Thanks
> Swapnil

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



CSV reader 2.2.0 issue

2018-03-05 Thread SNEHASISH DUTTA
 Hi,

I am using spark 2.2 csv reader

I have data in following format

123|123|"abc"||""|"xyz"

the requirement is || has to be treated as null
and "" has to be treated as blank character of length 0

I was using option sep as pipe
And option quote as ""
Parsed the data and using regex I was able to fulfill all the mentioned
conditions.
It started failing when I started column values like this "|" i.e.
separator itself has become a column value , spark csv reader started using
this value and made extra columns.

After this I used the escape option on "|", but results are similar.

I then tried dataset with split on "\\|" which had similar outcome

Is there any way to resolve this , with csv reader ?


Thanks and Regards,
Snehasish