Personally I'd love to see some kind of pluggability, configurability in the JSON schema parsing, maybe as an option in the DataFrameReader. Perhaps you can propose an API?
> On Jan 18, 2017, at 5:51 AM, Brian Hong <sungjinh...@devsisters.com> wrote: > > I work for a mobile game company. I'm solving a simple question: "Can we > efficiently/cheaply query for the log of a particular user within given date > period?" > > I've created a special JSON text-based file format that has these traits: > - Snappy compressed, saved in AWS S3 > - Partitioned by date. ie. 2017-01-01.sz <http://2017-01-01.sz/>, > 2017-01-02.sz <http://2017-01-02.sz/>, ... > - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy > block compressed by 5MB blocks > - Blocks are indexed with primary/secondary key in file 2017-01-01.json > - Efficient block based random access on primary key (log_type) and > secondary key (user_id) using the index > > I've created a Spark SQL DataFrame relation that can query this file format. > Since the schema of each log type is fairly consistent, I've reused the > `InferSchema.inferSchema` method and `JacksonParser`in the Spark SQL code to > support structured querying. I've also implemented filter push-down to > optimize the file access. > > It is very fast when querying for a single user or querying for a single log > type with a sampling ratio of 10000 to 1 compared to parquet file format. > (We do use parquet for some log types when we need batch analysis.) > > One of the problems we face is that the methods we use above are private API. > So we are forced to use hacks to use these methods. (Things like copying > the code or using the org.apache.spark.sql package namespace) > > I've been following Spark SQL code since 1.4, and the JSON schema inferencing > code and JacksonParser seem to be relatively stable recently. Can the > core-devs make these APIs public? > > We are willing to open source this file format because it is very excellent > for archiving user related logs in S3. The key dependency of private APIs in > Spark SQL is the main hurdle in making this a reality. > > Thank you for reading! >