Re: HDFS small file generation problem

Brett Antonides Fri, 02 Oct 2015 09:19:16 -0700

I had a very similar problem and solved it with Hive and ORC files using
the Spark SQLContext.
* Create a table in Hive stored as an ORC file (I recommend using
partitioning too)
* Use SQLContext.sql to Insert data into the table
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge
your many small files into larger files optimized for your HDFS block size
   * Since the CONCATENATE command operates on files in place it is
transparent to any downstream processing


Cheers,
Brett


On Fri, Oct 2, 2015 at 3:48 PM, <nib...@free.fr> wrote:

> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (name of file being the messageID), is it
> possible with archive ?
>
> Tks
> Nicolas
>
> ----- Mail original -----
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, "user" <user@spark.apache.org>
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
> Use hadoop archive
>
>
>
> Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
>
>
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: HDFS small file generation problem

Reply via email to