Thanks TD.

BTW - If I have input file ~ 250 GBs - Is there any guideline on whether to use:

        * a single input (250 GB) (in this case is there any max upper bound) 
or 

        * split into 1000 files each of 250 MB (hdfs block size is 250 MB) or 

        * a multiple of hdfs block size.
Mans





On Friday, July 11, 2014 4:38 PM, Tathagata Das <tathagata.das1...@gmail.com> 
wrote:
 


The model for file stream is to pick up and process new files written 
atomically (by move) into a directory. So your file is being processed in a 
single batch, and then its waiting for any new files to be written into that 
directory. 

TD



On Fri, Jul 11, 2014 at 11:46 AM, M Singh <mans6si...@yahoo.com> wrote:

So, is it expected for the process to generate stages/tasks even after 
processing a file ?
>
>
>Also, is there a way to figure out the file that is getting processed and when 
>that process is complete ?
>
>
>Thanks
>
>
>
>
>On Friday, July 11, 2014 1:51 PM, Tathagata Das <tathagata.das1...@gmail.com> 
>wrote:
> 
>
>
>Whenever you need to do a shuffle=based operation like reduceByKey, 
>groupByKey, join, etc., the system is essentially redistributing the data 
>across the cluster and it needs to know how many parts should it divide the 
>data into. Thats where the default parallelism is used. 
>
>
>TD
>
>
>
>On Fri, Jul 11, 2014 at 3:16 AM, M Singh <mans6si...@yahoo.com> wrote:
>
>Hi TD:
>>
>>
>>The input file is on hdfs.  
>>
>>
>>The file is approx 2.7 GB and when the process starts, there are 11 tasks 
>>(since hdfs block size is 256M) for processing and 2 tasks for reduce by key. 
>> After the file has been processed, I see new stages with 2 tasks that 
>>continue to be generated. I understand this value (2) is the default value 
>>for spark.default.parallelism but don't quite understand how is the value 
>>determined for generating tasks for reduceByKey, how is it used besides 
>>reduceByKey and what should be the optimal value for this. 
>>
>>
>>Thanks.
>>
>>
>>
>>On Thursday, July 10, 2014 7:24 PM, Tathagata Das 
>><tathagata.das1...@gmail.com> wrote:
>> 
>>
>>
>>How are you supplying the text file? 
>>
>>
>>
>>On Wed, Jul 9, 2014 at 11:51 AM, M Singh <mans6si...@yahoo.com> wrote:
>>
>>Hi Folks:
>>>
>>>
>>>
>>>I am working on an application which uses spark streaming (version 1.1.0 
>>>snapshot on a standalone cluster) to process text file and save counters in 
>>>cassandra based on fields in each row.  I am testing the application in two 
>>>modes:  
>>>
>>>     * Process each row and save the counter in cassandra.  In this scenario 
>>> after the text file has been consumed, there is no task/stages seen in the 
>>> spark UI.
>>>
>>>     * If instead I use reduce by key before saving to cassandra, the spark 
>>> UI shows continuous generation of tasks/stages even after processing the 
>>> file has been completed. 
>>>
>>>I believe this is because the reduce by key requires merging of data from 
>>>different partitions.  But I was wondering if anyone has any 
>>>insights/pointers for understanding this difference in behavior and how to 
>>>avoid generating tasks/stages when there is no data (new file) available.
>>>
>>>
>>>Thanks
>>>
>>>Mans
>>
>>
>>
>
>
>

Reply via email to