That is internal, but the amount of code is not a lot. Can you just copy the relevant classes over to your project?
On Wed, Jan 18, 2017 at 5:52 AM Brian Hong <sungjinh...@devsisters.com> wrote: > I work for a mobile game company. I'm solving a simple question: "Can we > efficiently/cheaply query for the log of a particular user within given > date period?" > > I've created a special JSON text-based file format that has these traits: > - Snappy compressed, saved in AWS S3 > - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ... > - Sorted by a primary key (log_type) and a secondary key (user_id), > Snappy block compressed by 5MB blocks > - Blocks are indexed with primary/secondary key in file 2017-01-01.json > - Efficient block based random access on primary key (log_type) and > secondary key (user_id) using the index > > I've created a Spark SQL DataFrame relation that can query this file > format. Since the schema of each log type is fairly consistent, I've > reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark > SQL code to support structured querying. I've also implemented filter > push-down to optimize the file access. > > It is very fast when querying for a single user or querying for a single > log type with a sampling ratio of 10000 to 1 compared to parquet file > format. (We do use parquet for some log types when we need batch analysis.) > > One of the problems we face is that the methods we use above are private > API. So we are forced to use hacks to use these methods. (Things like > copying the code or using the org.apache.spark.sql package namespace) > > I've been following Spark SQL code since 1.4, and the JSON schema > inferencing code and JacksonParser seem to be relatively stable recently. > Can the core-devs make these APIs public? > > We are willing to open source this file format because it is very > excellent for archiving user related logs in S3. The key dependency of > private APIs in Spark SQL is the main hurdle in making this a reality. > > Thank you for reading! > > > >