Hi Singh!

For this use-case its better to have a Streaming context listening to that
directory in hdfs where the files are being dropped and you can set the
Streaming interval as 15 minutes and let this driver program run
continuously, so as soon as new files are arrived they are taken for
processing in every 15 minutes. In this way, you don't have to worry about
the old files unless you are about to restart the driver program. Another
implementation would be after processing of each batch, you can simply move
those processed files to another directory or so.

Thanks
Best Regards


On Thu, Jul 3, 2014 at 6:34 PM, M Singh <mans6si...@yahoo.com> wrote:

> Hi:
>
> I am working on a project where a few thousand text files (~20M in size)
> will be dropped in an hdfs directory every 15 minutes.  Data from the file
> will used to update counters in cassandra (non-idempotent operation).  I
> was wondering what is the best to deal with this:
>
>    - Use text streaming and process the files as they are added to the
>    directory
>    - Use non-streaming text input and launch a spark driver every 15
>    minutes to process files from a specified directory (new directory for
>    every 15 minutes).
>    - Use message queue to ingest data from the files and then read data
>    from the queue.
>
> Also, is there a way to to find which text file is being processed and
> when a file has been processed for both the streaming and non-streaming
> RDDs.  I believe filename is available in the WholeTextFileInputFormat but
> is it available in standard or streaming text RDDs.
>
> Thanks
>
> Mans
>

Reply via email to