> Thanks Gopal. Does ORC conversion have to see the entire data before it can
>  write output? My source table is fairly large (~70 TB) which I am trying to
>  convert to ORC. 

Only for partitioned/bucketed tables.

In case of partitioned tables, you do not want each task opening new files 
(considering there are 70,000 tasks there) in each partition.

The data load is shuffled to bring partitions together, to avoid ending up with 
a few million files per-TB of data.

There are always faster ways to do this, if you control the inputs (insert 1 
day at a time or 1 week) or can make assumptions about them (each file contains 
1 day).

But until you can confirm which of the YARN directories are big, I wouldn't 
start on attempting that yet.

Cheers,
Gopal



Reply via email to