Re: Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All,

It appears that the bottleneck in my job was the EBS volumes. Very high i/o
wait times across the cluster. I was only using 1 volume. Increasing to 4
made it faster.

Thanks,
Pradeep

On Thu, Apr 20, 2017 at 3:12 PM, Pradeep Gollakota 
wrote:

> Hi All,
>
> I have a simple ETL job that reads some data, shuffles it and writes it
> back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0.
>
> After Stage 0 completes and the job starts Stage 1, I see a huge slowdown
> in the job. The CPU usage is low on the cluster, as is the network I/O.
> From the Spark Stats, I see large values for the Shuffle Read Blocked Time.
> As an example, one of my tasks completed in 18 minutes, but spent 15
> minutes waiting for remote reads.
>
> I'm not sure why the shuffle is so slow. Are there things I can do to
> increase the performance of the shuffle?
>
> Thanks,
> Pradeep
>


Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All,

I have a simple ETL job that reads some data, shuffles it and writes it
back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0.

After Stage 0 completes and the job starts Stage 1, I see a huge slowdown
in the job. The CPU usage is low on the cluster, as is the network I/O.
>From the Spark Stats, I see large values for the Shuffle Read Blocked Time.
As an example, one of my tasks completed in 18 minutes, but spent 15
minutes waiting for remote reads.

I'm not sure why the shuffle is so slow. Are there things I can do to
increase the performance of the shuffle?

Thanks,
Pradeep


Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
Usually this kind of thing can be done at a lower level in the InputFormat
usually by specifying the max split size. Have you looked into that
possibility with your InputFormat?

On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu  wrote:

> Hi Jasbir,
>
> Yes, you are right. Do you have any idea about my question?
>
> Thanks,
> Fei
>
> On Mon, Jan 16, 2017 at 12:37 AM,  wrote:
>
>> Hi,
>>
>>
>>
>> Coalesce is used to decrease the number of partitions. If you give the
>> value of numPartitions greater than the current partition, I don’t think
>> RDD number of partitions will be increased.
>>
>>
>>
>> Thanks,
>>
>> Jasbir
>>
>>
>>
>> *From:* Fei Hu [mailto:hufe...@gmail.com]
>> *Sent:* Sunday, January 15, 2017 10:10 PM
>> *To:* zouz...@cs.toronto.edu
>> *Cc:* user @spark ; d...@spark.apache.org
>> *Subject:* Re: Equally split a RDD partition into two partition at the
>> same node
>>
>>
>>
>> Hi Anastasios,
>>
>>
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>>
>>
>> Thanks,
>>
>> Fei
>>
>>
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
>> wrote:
>>
>> Hi Fei,
>>
>>
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>>
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>> 
>>
>>
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>>
>>
>> Best,
>>
>> Anastasios
>>
>>
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:
>>
>> Dear all,
>>
>>
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>>
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>>
>>
>> Thanks in advance,
>>
>> Fei
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> -- Anastasios Zouzias
>>
>>
>>
>> --
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> 
>> __
>>
>> www.accenture.com
>>
>
>


Executors under utilized

2016-10-06 Thread Pradeep Gollakota
I'm running a job that one stage with about 60k tasks. The stage was going
pretty well until many of the executors were not running any tasks at
around 35k tasks finished. It came to the point where only 4 executors are
working on data, all 4 executors are running on the same host. With about
25k tasks still pending, running on only 32 cores (4x8) is quite a huge
slowdown (full cluster is 288 cores).

My input format is a simple CombineFileInputFormat. I'm not sure what would
cause this behavior to occur. Any thoughts on how I can figure out what is
happening?

Thanks,
Pradeep


Re: Spark Website

2016-07-13 Thread Pradeep Gollakota
Worked for me if I go to https://spark.apache.org/site/ but not
https://spark.apache.org

On Wed, Jul 13, 2016 at 11:48 AM, Maurin Lenglart 
wrote:

> Same here
>
>
>
> *From: *Benjamin Kim 
> *Date: *Wednesday, July 13, 2016 at 11:47 AM
> *To: *manish ranjan 
> *Cc: *user 
> *Subject: *Re: Spark Website
>
>
>
> It takes me to the directories instead of the webpage.
>
>
>
> On Jul 13, 2016, at 11:45 AM, manish ranjan  wrote:
>
>
>
> working for me. What do you mean 'as supposed to'?
>
>
> ~Manish
>
>
>
> On Wed, Jul 13, 2016 at 11:45 AM, Benjamin Kim  wrote:
>
> Has anyone noticed that the spark.apache.org is not working as supposed
> to?
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
Looks like what I was suggesting doesn't work. :/

On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang  wrote:

> Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
> spark side, we just need to provide API to for that.
>
> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota 
> wrote:
>
>> IIRC, TextInputFormat supports an input path that is a comma separated
>> list. I haven't tried this, but I think you should just be able to do
>> sc.textFile("file1,file2,...")
>>
>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:
>>
>>> I know these workaround, but wouldn't it be more convenient and
>>> straightforward to use SparkContext#textFiles ?
>>>
>>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
>>> wrote:
>>>
>>>> For more than a small number of files, you'd be better off using
>>>> SparkContext#union instead of RDD#union.  That will avoid building up a
>>>> lengthy lineage.
>>>>
>>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>>>> wrote:
>>>>
>>>>> Hey Jeff,
>>>>> Do you mean reading from multiple text files? In that case, as a
>>>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>>>> multiple rdds. For example:
>>>>>
>>>>> val lines1 = sc.textFile("file1")
>>>>> val lines2 = sc.textFile("file2")
>>>>>
>>>>> val rdd = lines1 union lines2
>>>>>
>>>>> regards,
>>>>> --Jakob
>>>>>
>>>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>>>
>>>>>> Although user can use the hdfs glob syntax to support multiple
>>>>>> inputs. But sometimes, it is not convenient to do that. Not sure why
>>>>>> there's no api of SparkContext#textFiles. It should be easy to implement
>>>>>> that. I'd love to create a ticket and contribute for that if there's no
>>>>>> other consideration that I don't know.
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
IIRC, TextInputFormat supports an input path that is a comma separated
list. I haven't tried this, but I think you should just be able to do
sc.textFile("file1,file2,...")

On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:

> I know these workaround, but wouldn't it be more convenient and
> straightforward to use SparkContext#textFiles ?
>
> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
> wrote:
>
>> For more than a small number of files, you'd be better off using
>> SparkContext#union instead of RDD#union.  That will avoid building up a
>> lengthy lineage.
>>
>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>> wrote:
>>
>>> Hey Jeff,
>>> Do you mean reading from multiple text files? In that case, as a
>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>> multiple rdds. For example:
>>>
>>> val lines1 = sc.textFile("file1")
>>> val lines2 = sc.textFile("file2")
>>>
>>> val rdd = lines1 union lines2
>>>
>>> regards,
>>> --Jakob
>>>
>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>
 Although user can use the hdfs glob syntax to support multiple inputs.
 But sometimes, it is not convenient to do that. Not sure why there's no api
 of SparkContext#textFiles. It should be easy to implement that. I'd love to
 create a ticket and contribute for that if there's no other consideration
 that I don't know.

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>