Re: Too many files/dirs in hdfs

Mohit Anchlia Tue, 18 Aug 2015 10:36:13 -0700

Is there a way to store all the results in one file and keep the file roll
over separate than the spark streaming batch interval?


On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <umesh9...@gmail.com>
wrote:

> In Spark Streaming you can simply check whether your RDD contains any
> records or not and if records are there you can save them using
> FIleOutputStream:
>
> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE YOUR
> STUFF} };
>
> This will not create unnecessary files of 0 bytes.
>
> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Currently, spark streaming would create a new directory for every batch
>> and store the data to it (whether it has anything or not). There is no
>> direct append call as of now, but you can achieve this either with
>> FileUtil.copyMerge
>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>> or have a separate program which will do the clean up for you.
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanch...@gmail.com>
>> wrote:
>>
>>> Spark stream seems to be creating 0 bytes files even when there is no
>>> data. Also, I have 2 concerns here:
>>>
>>> 1) Extra unnecessary files is being created from the output
>>> 2) Hadoop doesn't work really well with too many files and I see that it
>>> is creating a directory with a timestamp every 1 second. Is there a better
>>> way of writing a file, may be use some kind of append mechanism where one
>>> doesn't have to change the batch interval.
>>>
>>
>>
>

Re: Too many files/dirs in hdfs

Reply via email to