Re: support of RCFile

Ryan Blue Wed, 29 Sep 2021 09:15:21 -0700

Youjun, what are you trying to do?

If you have existing tables in an incompatible format, you may just want to
leave them as they are for historical data. It depends on why you want to
use Iceberg. If you want to be able to query larger ranges of that data
because you've clustered across files by filter columns, then you'd want to
build the Iceberg metadata. But if you have a lot of historical data that
hasn't been clustered and is unlikely to be rewritten, then keeping old
tables in RCFile and doing new work in Iceberg could be a better option.


You may also want to check how much savings you get out of using Iceberg
with Parquet files vs RCFile. If you find that you can cluster your data
for better queries and that ends up making your dataset considerably
smaller then maybe it's worth the conversion that Russell suggested. RCFile
is pretty old so I think there's a good chance you'd save a lot of space --
just updating from an old compression codec to something more modern like
snappy to lz4 or gzip to zstd could be a big win.

Ryan

On Wed, Sep 29, 2021 at 8:49 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> Within Iceberg it would take a bit of effort, we would need custom readers
> at the minimum if we just wanted to make it ReadOnly support. I think the
> main complexity would be designing the specific readers for the platform
> you want to use like "Spark" or "Flink", the actual metadata handling and
> such would probably be pretty straightforward. I would definitely size it
> as at least a several week project and I'm not sure we would want to
> support it in OSS Iceberg.
>
> On Wed, Sep 29, 2021 at 10:40 AM 袁尤军 <wdyuanyou...@163.com> wrote:
>
>> thanks for the suggestion. we need to evaluate the cost to convert the
>> format, as those hive tables  have been there for many years, so PB data
>> need to reformat.
>>
>> also, do you think it is possible to develop the support for a new
>> format? how costly is it?
>>
>> 发自我的iPhone
>>
>> > 在 2021年9月29日，下午9:34，Russell Spitzer <russell.spit...@gmail.com> 写道：
>> >
>> > There is no plan I am aware of using RCFiles directly in Iceberg.
>> While we could work to support other file formats, I don't think it is very
>> widely used compared to ORC and Parquet (Iceberg has native support for
>> these formats).
>> >
>> > My suggestion for conversion would be to do a CTAS statement in Spark
>> and have the table completely converted over to Parquet (or ORC). This is
>> probably the simplest way.
>> >
>> >> On Sep 29, 2021, at 7:01 AM, yuan youjun <yuanyou...@gmail.com> wrote:
>> >>
>> >> Hi community,
>> >>
>> >> I am exploring ways to evolute existing hive tables (RCFile)  into
>> data lake. However I found out that iceberg (or Hudi, delta lake) does not
>> support RCFile. So my questions are:
>> >> 1, is there any plan (or is it possible) to support RCFile in the
>> future? So we can manage those existing data file without re-formating.
>> >> 2, If no such plan, do you have any suggestion to migrate RCFiles into
>> iceberg?
>> >>
>> >> Thanks
>> >> Youjun
>>
>>
>>

-- 
Ryan Blue
Tabular

Re: support of RCFile

Reply via email to