I had a very similar problem and solved it with Hive and ORC files using the Spark SQLContext. * Create a table in Hive stored as an ORC file (I recommend using partitioning too) * Use SQLContext.sql to Insert data into the table * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge your many small files into larger files optimized for your HDFS block size * Since the CONCATENATE command operates on files in place it is transparent to any downstream processing
Cheers, Brett On Fri, Oct 2, 2015 at 3:48 PM, <nib...@free.fr> wrote: > Hello, > Yes but : > - In the Java API I don't find a API to create a HDFS archive > - As soon as I receive a message (with messageID) I need to replace the > old existing file by the new one (name of file being the messageID), is it > possible with archive ? > > Tks > Nicolas > > ----- Mail original ----- > De: "Jörn Franke" <jornfra...@gmail.com> > À: nib...@free.fr, "user" <user@spark.apache.org> > Envoyé: Lundi 28 Septembre 2015 23:53:56 > Objet: Re: HDFS small file generation problem > > > > Use hadoop archive > > > > Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit : > > > Hello, > I'm still investigating my small file generation problem generated by my > Spark Streaming jobs. > Indeed, my Spark Streaming jobs are receiving a lot of small events (avg > 10kb), and I have to store them inside HDFS in order to treat them by PIG > jobs on-demand. > The problem is the fact that I generate a lot of small files in HDFS > (several millions) and it can be problematic. > I investigated to use Hbase or Archive file but I don't want to do it > finally. > So, what about this solution : > - Spark streaming generate on the fly several millions of small files in > HDFS > - Each night I merge them inside a big daily file > - I launch my PIG jobs on this big file ? > > Other question I have : > - Is it possible to append a big file (daily) by adding on the fly my > event ? > > Tks a lot > Nicolas > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >