Re: 1 big file or multiple smaller files for loading data from a database?

Edward Capriolo Wed, 07 Jul 2010 18:30:47 -0700

On Wed, Jul 7, 2010 at 9:11 PM, Todd Lee <[email protected]> wrote:
> thanks. but is it going to create 1 big file in HDFS? I am currently
> considering writing my own cascading job for this.
> thx,
> T
>
> On Wed, Jul 7, 2010 at 6:06 PM, Sarah Sproehnle <[email protected]> wrote:
>>
>> Hi Todd,
>>
>> Are you planning to use Sqoop to do this import?  If not, you should.
>> :)  It will do a parallel import, using MapReduce, to load the table
>> into Hadoop.  With the --hive-import option, it will also create the
>> Hive table definition.
>>
>> Cheers,
>> Sarah
>>
>> On Wed, Jul 7, 2010 at 5:51 PM, Todd Lee <[email protected]> wrote:
>> > Hi,
>> > I am new to Hive and Hadoop in general. I have a table in Oracle that
>> > has
>> > millions of rows and I'd like to export it into HDFS so that I can run
>> > some
>> > Hive queries. My first question is, is it recommended to export the
>> > entire
>> > table as a single file (possibly 5GB), or more files with smaller sizes
>> > (10
>> > files each 500mb)? also, does it matter if I put the files under
>> > different
>> > sub-directories before I do the data load in Hive? or everything has to
>> > be
>> > under the same folder?
>> > Thanks,
>> > T
>> > p.s. I am sorry if this post is submitted twice.
>>
>>
>>
>> --
>> Sarah Sproehnle
>> Educational Services
>> Cloudera, Inc
>> http://www.cloudera.com/training
>
>


Hadoop does not handle many small files well. Look up "hadoop small
file problem". Performance wise you should try to have as few files as
possible, but you should notice no difference in runtime between 1, 5
or even 500 files when your data is as big as 5GB.

Re: 1 big file or multiple smaller files for loading data from a database?

Reply via email to