Hi: I am working on a project where a few thousand text files (~20M in size) will be dropped in an hdfs directory every 15 minutes. Data from the file will used to update counters in cassandra (non-idempotent operation). I was wondering what is the best to deal with this: * Use text streaming and process the files as they are added to the directory * Use non-streaming text input and launch a spark driver every 15 minutes to process files from a specified directory (new directory for every 15 minutes). * Use message queue to ingest data from the files and then read data from the queue. Also, is there a way to to find which text file is being processed and when a file has been processed for both the streaming and non-streaming RDDs. I believe filename is available in the WholeTextFileInputFormat but is it available in standard or streaming text RDDs.
Thanks Mans