Hi:

I am working on a project where a few thousand text files (~20M in size) will 
be dropped in an hdfs directory every 15 minutes.  Data from the file will used 
to update counters in cassandra (non-idempotent operation).  I was wondering 
what is the best to deal with this:
        * Use text streaming and process the files as they are added to the 
directory
        * Use non-streaming text input and launch a spark driver every 15 
minutes to process files from a specified directory (new directory for every 15 
minutes).
        * Use message queue to ingest data from the files and then read data 
from the queue.
Also, is there a way to to find which text file is being processed and when a 
file has been processed for both the streaming and non-streaming RDDs.  I 
believe filename is available in the WholeTextFileInputFormat but is it 
available in standard or streaming text RDDs.

Thanks

Mans 

Reply via email to