Thanks for the response.
My intention is to have many unrelated datasets (not, if I understand
you correctly, a collection of totally different objects). The datasets
can be extremely wide (1000s of columns) and very deep (billions of
rows), and very denormalized (single table) and I need to do quick
aggregations of column data - hence why I though Parquet/HDFS/Spark was
my best current choice.
If ALL I had to do were aggregations I'd pick a column oriented DB like
Vertica or Hana (or maybe Druid) but I also need to run various Machine
Learning routines so the combination of Spark/HDFS/Parquet looked like
one solution for both problems.
Of course, I'm open to other suggestions.
The example you sent looks like what I'm looking for. Thanks!
Jim
On 08/26/2014 02:30 PM, Dmitriy Ryaboy wrote:
1) you don't have to shell out to a compiler to generate code... but that's
complicated :).
2) Avro can be dynamic. I haven't played with that side of the world, but
this tutorial might help get you started:
https://github.com/AndreSchumacher/avro-parquet-spark-example
3) Do note that you should have 1 schema per dataset (maybe a schema you
didn't know until you started writing the dataset, but a schema
nonetheless). If your notion is to have a collection of totally different
objects, parquet is a bad choice.
D
On Tue, Aug 26, 2014 at 11:14 AM, Jim <[email protected]> wrote:
Hello all,
I couldn't find a user list so my apologies if this falls in the wrong
place. I'm looking for a little guidance. I'm a newbie with respect to
Parquet.
We have a use case where we don't want concrete POJOs to represent data in
our store. It's dynamic in that each data set is unique and dynamic and we
need to handle incoming datasets at runtime.
Examples of how to write to Parquet are sparse and all of the ones I could
find assume Thrift/Avro/Protobuf IDL and generated schema and POJOs. I
don't want to dynamically generate an IDL, shell out to a compiler, and
classload the results in order to use Parquet. Is there an example that
does what I'm looking for?
Thanks
Jim