Hi Akhil:

Thanks for your response.

Mans



On Thursday, July 3, 2014 9:16 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
 


Hi Singh!

For this use-case its better to have a Streaming context listening to that 
directory in hdfs where the files are being dropped and you can set the 
Streaming interval as 15 minutes and let this driver program run continuously, 
so as soon as new files are arrived they are taken for processing in every 15 
minutes. In this way, you don't have to worry about the old files unless you 
are about to restart the driver program. Another implementation would be after 
processing of each batch, you can simply move those processed files to another 
directory or so.


Thanks
Best Regards


On Thu, Jul 3, 2014 at 6:34 PM, M Singh <mans6si...@yahoo.com> wrote:

Hi:
>
>
>I am working on a project where a few thousand text files (~20M in size) will 
>be dropped in an hdfs directory every 15 minutes.  Data from the file will 
>used to update counters in cassandra (non-idempotent operation).  I was 
>wondering what is the best to deal with this:
>       * Use text streaming and process the files as they are added to the 
> directory
>       * Use non-streaming text input and launch a spark driver every 15 
> minutes to process files from a specified directory (new directory for every 
> 15 minutes).
>       * Use message queue to ingest data from the files and then read data 
> from the queue.
>Also, is there a way to to find which text file is being processed and when a 
>file has been processed for both the streaming and non-streaming RDDs.  I 
>believe filename is available in the WholeTextFileInputFormat but is it 
>available in standard or streaming text RDDs.
>
>
>Thanks
>
>
>Mans 

Reply via email to