Hi folks, 
I am writing to ask how to filter and partition a set of files thru Spark. 
The situation is that I have N big files (cannot fit into single machine). And 
each line of files starts with a category (say Sport, Food, etc), while only 
have less than 100 categories actually. I need a program to scan the file set 
and aggregate each line by category and save them separately in different 
folders with right partition. 
For instance, I want the program to generate a Sport folder which contains all 
lines of data with category sport. Also  not like to put all things into a 
single file which might be too big.
Any ideas how to implement this logic efficiently by Spark? I believe groupBy 
is not acceptable since even all data belongs to a single category is too big 
to fit into a single machine. 
RegardsYunsima                                    

Reply via email to