Jakub,

You're right that Spark currently doesn't use the vectorized read path for
nested data, but I'm not sure that's the problem here. With 50k elements in
the f1 array, it could easily be that you're getting the significant
speed-up from not reading or materializing that column. The non-vectorized
path is slower, but it is more likely that the problem is the data if it is
that much slower.

I'd be happy to see vectorization for nested Parquet data move forward, but
I think you might want to get an idea of how much it will help before you
move forward with it. Can you use Impala to test whether vectorization
would help here?

rb



On Mon, Jun 11, 2018 at 6:16 AM, Jakub Wozniak <jakub.wozn...@cern.ch>
wrote:

> Hello,
>
> We have stumbled upon a quite degraded performance when reading a complex
> (struct, array) type columns stored in Parquet.
> A Parquet file is of around 600MB (snappy) with ~400k rows with a field of
> a complex type { f1: array of ints, f2: array of ints } where f1 array
> length is 50k elements.
> There are also other fields like entity_id: long, timestamp: long.
>
> A simple query that selects rows using predicates entity_id = X and
> timestamp >= T1 and timestamp <= T2 plus ds.show() takes 17 minutes to
> execute.
> If we remove the complex type columns from the query it is executed in a
> sub-second time.
>
> Now when looking at the implementation of the Parquet datasource the
> Vectorized* classes are used only if the read types are primitives. In
> other case the code falls back to the parquet-mr default implementation.
> In the VectorizedParquetRecordReader there is a TODO to handle complex
> types that "should be efficient & easy with codegen".
>
> For our CERN Spark usage the current execution times are pretty much
> prohibitive as there is a lot of data stored as arrays / complex types…
> The file of 600 MB represents 1 day of measurements and our data
> scientists would like to process sometimes months or even years of those.
>
> Could you please let me know if there is anybody currently working on it
> or maybe you have it in a roadmap for the future?
> Or maybe you could give me some suggestions how to avoid / resolve this
> problem? I’m using Spark 2.2.1.
>
> Best regards,
> Jakub Wozniak
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to