Re: About dealing with evolving schema

Alex Levenson Fri, 08 May 2015 19:56:55 -0700

Hello,

Here's some quick responses, but TLDR a few hundred columns shouldn't be a
problem relative to storing the data in other formats, but an order of
magnitude more columns probably means an order of magnitude of more data,
so there's of course going to be a big difference there (unless the new
columns are very often null).

>> What is the overhead of updating the schema?
Well, technically none, you'll be writing new parquet files with new data,
not updating files in place, you'll probably keep the old files as is.

When you evolve a schema, if you want to be able to read your old parquet
files alongside your new ones, then you need to evolve the schema in a
backwards compatible way -- eg only adding new, optional, fields. (adding a
new required field means your old data can't satisfy this required
constraint).

>> When scaling to hundreds of columns, do the performances decrease
noticeably?

More columns usually means more data, unless the columns are frequently
null. More data means more work to do in the write path, and more storage
on disk, though the read path may not degrade if you project to just the
columns you need (read path should scale with number of columns selected).
But this isn't really a degradation in the sense that this is true of
pretty much any storage format.

A little more detail on this:

1) The memory footprint in the write path scales with the number of
columns, and with how often those columns have data in them (columns of
mostly nulls are cheaper than columns that are often not null). The amount
of time it takes to write also scales with the number of columns / and how
often the columns aren't null (as you'd imagine). Parquet can write
millions of consecutive nulls in a few bytes which is nice.

2) That said, we have some schemas with about 4,000 columns and we're able
to write them without insane amounts of heap space (3 to 6 GB I think),
though this could be relying on the fact that many of the columns are
frequently null. Hundreds of columns should be fine.

3) In the read path, the read performance scales with the number of columns
you load. Parquet can actually perform worse than record-oriented storage
formats if you load all of your data's columns. Parquet has the benefit of
compressing better than a row oriented storage, which means better read
performance, but it also has to do work to re-assemble the columns into a
record, so these two things compete when you are loading many columns.

>> Do the performance of reading from/writing to Parquet files depend on
the processing system used like: Impala, Hive, Spark, etc.?

Hopefully, impala is much faster, as it's written in C++ with a focus on
speed. As for hive, spark, plain map reduce, scalding, cascading, etc.
those all share the same core (java) implementation of parquet, so they are
probably relatively similar in speed. The part where they might differ is
in what object model you use -- hive will use its own object model,
scalding / cascading you might use thrift or avro or protobuf, and these
implementations probably have differences in performance (probably falls
under 'don't worry about it until it's a problem' though).

On Fri, May 8, 2015 at 11:26 AM, Mohamed Nadjib MAMI <[email protected]>
wrote:

> Dear all,
>
> I'm in a decisive moment where I should have answers to few questions. So
> any (prompt, even incomplete) answer is highly appreciated.
>
> - What is the overhead of updating the schema? in my case, only adding new
> columns; knowing that the schema could evolve from few tens to few hundreds
> of columns.
> - When scaling to hundreds of columns, do the performances decrease
> noticeably?
> - Do the performance of reading from/writing to Parquet files depend on
> the processing system used like: Impala, Hive, Spark, etc.?
>
> Regards, Grüße, Cordialement, Recuerdos, Saluti, προσρήσεις, 问候, تحياتي.
> Mohamed Nadjib Mami
> PhD Student - EIS Group- Bonn University (Germany).
> About me! <http://www.strikingly.com/mohamed-nadjib-mami> LinkedIn <
> http://fr.linkedin.com/in/mohamednadjibmami/>
>

-- 
Alex Levenson
@THISWILLWORK

Re: About dealing with evolving schema

Reply via email to