Re: Is there a limit on the number of tasks in one job?

2016-06-14 Thread Khaled Hammouda
Yes, I check Spark UI to follow what’s going on. It seems to start several 
tasks fine (8 tasks in my case) out of ~70k tasks, and then stalls.

I actually was able to get things to work by disabling dynamic allocation. 
Basically I set the number of executors manually, which disables dynamic 
allocation. This seems to fix the problem.

My guess is that when faced with too many backlogged tasks, the dynamic 
allocator could be having trouble launching executors, or something similar. 
I’m not sure though if this is a bug, but maybe someone familiar with the 
internal of dynamic allocation can tell if this is a bug worth filing.

I’m using YARN as resource manager.

Khaled 

> On Jun 13, 2016, at 6:24 PM, Mich Talebzadeh  
> wrote:
> 
> Have you looked at spark GUI to see what it is waiting for. is that available 
> memory. What is the resource manager you are using?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 13 June 2016 at 20:45, Khaled Hammouda  > wrote:
> Hi Michael,
> 
> Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the 
> preview and tried using it, but I’m running into the exact same issue.
> 
> Khaled
> 
>> On Jun 13, 2016, at 2:58 PM, Michael Armbrust > > wrote:
>> 
>> You might try with the Spark 2.0 preview.  We spent a bunch of time 
>> improving the handling of many small files.
>> 
>> On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda > > wrote:
>> I'm trying to use Spark SQL to load json data that are split across about 70k
>> files across 24 directories in hdfs, using
>> sqlContext.read.json("hdfs:///user/hadoop/data/* <>/*").
>> 
>> This doesn't seem to work for some reason, I get timeout errors like the
>> following:
>> 
>> ---
>> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 
>>  has been quiet for 12
>> ms while there are outstanding requests. Assuming connection is dead; please
>> adjust spark.network.timeout if this is wrong.
>> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
>> outstanding when connection from
>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 
>>  is closed
>> ...
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>> ...
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [120 seconds]
>> --
>> 
>> I don't want to start tinkering with increasing timeouts yet. I tried to
>> load just one sub-directory, which contains around 4k files, and this seems
>> to work fine. So I thought of writing a loop where I load the json files
>> from each sub-dir and then unionAll the current dataframe with the previous
>> dataframe. However, this also fails because apparently the json files don't
>> have the exact same schema, causing this error:
>> 
>> ---
>> Traceback (most recent call last):
>>   File "/home/hadoop/load_json.py", line 65, in 
>> df = df.unionAll(hrdf)
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>> line 998, in unionAll
>>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>> line 813, in __call__
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>> 51, in deco
>> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
>> ---
>> 
>> I'd like to know what's preventing Spark from loading 70k files the same way
>> it's loading 4k files?
>> 
>> To give you some idea about my setup and data:
>> - ~70k files across 24 directories in HDFS
>> - Each directory contains 3k files on average
>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
>> available to YARN
>> - Spark 1.6.1
>> 
>> Thanks.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>>  
>> 
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> .
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>> 
> 
> 



Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Takeshi Yamamuro
Hi,

You can control an initial num. of partitions (tasks) in v2.0.
https://www.mail-archive.com/user@spark.apache.org/msg51603.html

// maropu


On Tue, Jun 14, 2016 at 7:24 AM, Mich Talebzadeh 
wrote:

> Have you looked at spark GUI to see what it is waiting for. is that
> available memory. What is the resource manager you are using?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 13 June 2016 at 20:45, Khaled Hammouda  wrote:
>
>> Hi Michael,
>>
>> Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the
>> preview and tried using it, but I’m running into the exact same issue.
>>
>> Khaled
>>
>> On Jun 13, 2016, at 2:58 PM, Michael Armbrust 
>> wrote:
>>
>> You might try with the Spark 2.0 preview.  We spent a bunch of time
>> improving the handling of many small files.
>>
>> On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda <
>> khaled.hammo...@kik.com> wrote:
>>
>>> I'm trying to use Spark SQL to load json data that are split across
>>> about 70k
>>> files across 24 directories in hdfs, using
>>> sqlContext.read.json("hdfs:///user/hadoop/data/*/*").
>>>
>>> This doesn't seem to work for some reason, I get timeout errors like the
>>> following:
>>>
>>> ---
>>> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
>>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 has been quiet for
>>> 12
>>> ms while there are outstanding requests. Assuming connection is dead;
>>> please
>>> adjust spark.network.timeout if this is wrong.
>>> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
>>> outstanding when connection from
>>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 is closed
>>> ...
>>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>>> ...
>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>>> [120 seconds]
>>> --
>>>
>>> I don't want to start tinkering with increasing timeouts yet. I tried to
>>> load just one sub-directory, which contains around 4k files, and this
>>> seems
>>> to work fine. So I thought of writing a loop where I load the json files
>>> from each sub-dir and then unionAll the current dataframe with the
>>> previous
>>> dataframe. However, this also fails because apparently the json files
>>> don't
>>> have the exact same schema, causing this error:
>>>
>>> ---
>>> Traceback (most recent call last):
>>>   File "/home/hadoop/load_json.py", line 65, in 
>>> df = df.unionAll(hrdf)
>>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>>> line 998, in unionAll
>>>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>>> line 813, in __call__
>>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>>> 51, in deco
>>> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
>>> ---
>>>
>>> I'd like to know what's preventing Spark from loading 70k files the same
>>> way
>>> it's loading 4k files?
>>>
>>> To give you some idea about my setup and data:
>>> - ~70k files across 24 directories in HDFS
>>> - Each directory contains 3k files on average
>>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
>>> available to YARN
>>> - Spark 1.6.1
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> .
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Mich Talebzadeh
Have you looked at spark GUI to see what it is waiting for. is that
available memory. What is the resource manager you are using?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 13 June 2016 at 20:45, Khaled Hammouda  wrote:

> Hi Michael,
>
> Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the
> preview and tried using it, but I’m running into the exact same issue.
>
> Khaled
>
> On Jun 13, 2016, at 2:58 PM, Michael Armbrust 
> wrote:
>
> You might try with the Spark 2.0 preview.  We spent a bunch of time
> improving the handling of many small files.
>
> On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda  > wrote:
>
>> I'm trying to use Spark SQL to load json data that are split across about
>> 70k
>> files across 24 directories in hdfs, using
>> sqlContext.read.json("hdfs:///user/hadoop/data/*/*").
>>
>> This doesn't seem to work for some reason, I get timeout errors like the
>> following:
>>
>> ---
>> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 has been quiet for
>> 12
>> ms while there are outstanding requests. Assuming connection is dead;
>> please
>> adjust spark.network.timeout if this is wrong.
>> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
>> outstanding when connection from
>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 is closed
>> ...
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>> ...
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [120 seconds]
>> --
>>
>> I don't want to start tinkering with increasing timeouts yet. I tried to
>> load just one sub-directory, which contains around 4k files, and this
>> seems
>> to work fine. So I thought of writing a loop where I load the json files
>> from each sub-dir and then unionAll the current dataframe with the
>> previous
>> dataframe. However, this also fails because apparently the json files
>> don't
>> have the exact same schema, causing this error:
>>
>> ---
>> Traceback (most recent call last):
>>   File "/home/hadoop/load_json.py", line 65, in 
>> df = df.unionAll(hrdf)
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>> line 998, in unionAll
>>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>> line 813, in __call__
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>> 51, in deco
>> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
>> ---
>>
>> I'd like to know what's preventing Spark from loading 70k files the same
>> way
>> it's loading 4k files?
>>
>> To give you some idea about my setup and data:
>> - ~70k files across 24 directories in HDFS
>> - Each directory contains 3k files on average
>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
>> available to YARN
>> - Spark 1.6.1
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Khaled Hammouda
Hi Michael,

Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the 
preview and tried using it, but I’m running into the exact same issue.

Khaled

> On Jun 13, 2016, at 2:58 PM, Michael Armbrust  wrote:
> 
> You might try with the Spark 2.0 preview.  We spent a bunch of time improving 
> the handling of many small files.
> 
> On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda  > wrote:
> I'm trying to use Spark SQL to load json data that are split across about 70k
> files across 24 directories in hdfs, using
> sqlContext.read.json("hdfs:///user/hadoop/data/*/*").
> 
> This doesn't seem to work for some reason, I get timeout errors like the
> following:
> 
> ---
> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 
>  has been quiet for 12
> ms while there are outstanding requests. Assuming connection is dead; please
> adjust spark.network.timeout if this is wrong.
> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
> outstanding when connection from
> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 
>  is closed
> ...
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
> --
> 
> I don't want to start tinkering with increasing timeouts yet. I tried to
> load just one sub-directory, which contains around 4k files, and this seems
> to work fine. So I thought of writing a loop where I load the json files
> from each sub-dir and then unionAll the current dataframe with the previous
> dataframe. However, this also fails because apparently the json files don't
> have the exact same schema, causing this error:
> 
> ---
> Traceback (most recent call last):
>   File "/home/hadoop/load_json.py", line 65, in 
> df = df.unionAll(hrdf)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
> line 998, in unionAll
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
> 51, in deco
> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
> ---
> 
> I'd like to know what's preventing Spark from loading 70k files the same way
> it's loading 4k files?
> 
> To give you some idea about my setup and data:
> - ~70k files across 24 directories in HDFS
> - Each directory contains 3k files on average
> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
> available to YARN
> - Spark 1.6.1
> 
> Thanks.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Michael Armbrust
You might try with the Spark 2.0 preview.  We spent a bunch of time
improving the handling of many small files.

On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda 
wrote:

> I'm trying to use Spark SQL to load json data that are split across about
> 70k
> files across 24 directories in hdfs, using
> sqlContext.read.json("hdfs:///user/hadoop/data/*/*").
>
> This doesn't seem to work for some reason, I get timeout errors like the
> following:
>
> ---
> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 has been quiet for
> 12
> ms while there are outstanding requests. Assuming connection is dead;
> please
> adjust spark.network.timeout if this is wrong.
> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
> outstanding when connection from
> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 is closed
> ...
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
> --
>
> I don't want to start tinkering with increasing timeouts yet. I tried to
> load just one sub-directory, which contains around 4k files, and this seems
> to work fine. So I thought of writing a loop where I load the json files
> from each sub-dir and then unionAll the current dataframe with the previous
> dataframe. However, this also fails because apparently the json files don't
> have the exact same schema, causing this error:
>
> ---
> Traceback (most recent call last):
>   File "/home/hadoop/load_json.py", line 65, in 
> df = df.unionAll(hrdf)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
> line 998, in unionAll
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
> 51, in deco
> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
> ---
>
> I'd like to know what's preventing Spark from loading 70k files the same
> way
> it's loading 4k files?
>
> To give you some idea about my setup and data:
> - ~70k files across 24 directories in HDFS
> - Each directory contains 3k files on average
> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
> available to YARN
> - Spark 1.6.1
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is there a limit on the number of tasks in one job?

2016-06-13 Thread khaled.hammouda
I'm trying to use Spark SQL to load json data that are split across about 70k
files across 24 directories in hdfs, using
sqlContext.read.json("hdfs:///user/hadoop/data/*/*").

This doesn't seem to work for some reason, I get timeout errors like the
following:

---
6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
ip-172-31-31-114.ec2.internal/172.31.31.114:46028 has been quiet for 12
ms while there are outstanding requests. Assuming connection is dead; please
adjust spark.network.timeout if this is wrong.
16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
outstanding when connection from
ip-172-31-31-114.ec2.internal/172.31.31.114:46028 is closed
...
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds]
--

I don't want to start tinkering with increasing timeouts yet. I tried to
load just one sub-directory, which contains around 4k files, and this seems
to work fine. So I thought of writing a loop where I load the json files
from each sub-dir and then unionAll the current dataframe with the previous
dataframe. However, this also fails because apparently the json files don't
have the exact same schema, causing this error:

---
Traceback (most recent call last):
  File "/home/hadoop/load_json.py", line 65, in 
df = df.unionAll(hrdf)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
line 998, in unionAll
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 813, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
51, in deco
pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
---

I'd like to know what's preventing Spark from loading 70k files the same way
it's loading 4k files?

To give you some idea about my setup and data:
- ~70k files across 24 directories in HDFS
- Each directory contains 3k files on average
- Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
available to YARN
- Spark 1.6.1

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org