Funny you should mention that. I tried that first. It failed on the saveAsParquetFile with a cryptic:

java.lang.RuntimeException: Unsupported dataType: StructType(ArrayBuffer(StructField( ... 500 columns worth of the same...) [1.7784] failure: `,' expected but `A' found"

I assumed this had to do with not including a schema.

On 08/26/2014 03:31 PM, Dmitriy Ryaboy wrote:
Nice -- using Spark to infer the json schema. Also a good way to do that.
Does it handle nesting and everything?


On Tue, Aug 26, 2014 at 12:16 PM, Michael Armbrust <[email protected]>
wrote:

A common use case we have been seeing for Spark SQL/Parquet is to take
semi-structured JSON data and transcode it to parquet.  Queries can then be
run over the parquet data with a huge speed up.  The nice thing about using
JSON is it doesn't require you to create POJOs and Spark SQL will
automatically infer the schema for you and create the equivalent parquet
metadata.


https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files


On Tue, Aug 26, 2014 at 11:38 AM, Jim <[email protected]> wrote:

Thanks for the response.

My intention is to have many unrelated datasets (not, if I understand you
correctly, a collection of totally different objects). The datasets can
be
extremely wide (1000s of columns) and very deep (billions of rows), and
very denormalized (single table) and I need to do quick aggregations of
column data - hence why I though Parquet/HDFS/Spark was my best current
choice.

If ALL I had to do were aggregations I'd pick a column oriented DB like
Vertica or Hana (or maybe Druid) but I also need to run various Machine
Learning routines so the combination of Spark/HDFS/Parquet looked like
one
solution for both problems.

Of course, I'm open to other suggestions.

The example you sent looks like what I'm looking for. Thanks!
Jim


On 08/26/2014 02:30 PM, Dmitriy Ryaboy wrote:

1) you don't have to shell out to a compiler to generate code... but
that's
complicated :).

2) Avro can be dynamic. I haven't played with that side of the world,
but
this tutorial might help get you started:
https://github.com/AndreSchumacher/avro-parquet-spark-example

3) Do note that you should have 1 schema per dataset (maybe a schema you
didn't know until you started writing the dataset, but a schema
nonetheless). If your notion is to have a collection of totally
different
objects, parquet is a bad choice.

D


On Tue, Aug 26, 2014 at 11:14 AM, Jim <[email protected]> wrote:

  Hello all,
I couldn't find a user list so my apologies if this falls in the wrong
place. I'm looking for a little guidance. I'm a newbie with respect to
Parquet.

We have a use case where we don't want concrete POJOs to represent data
in
our store. It's dynamic in that each data set is unique and dynamic and
we
need to handle incoming datasets at runtime.

Examples of how to write to Parquet are sparse and all of the ones I
could
find assume Thrift/Avro/Protobuf IDL and generated schema and POJOs. I
don't want to dynamically generate an IDL, shell out to a compiler, and
classload the results in order to use Parquet. Is there an example that
does what I'm looking for?

Thanks
Jim




Reply via email to