Re: [Spark SQL] Making InferSchema and JacksonParser public

Reynold Xin Wed, 18 Jan 2017 09:42:06 -0800

That is internal, but the amount of code is not a lot. Can you just copy
the relevant classes over to your project?


On Wed, Jan 18, 2017 at 5:52 AM Brian Hong <sungjinh...@devsisters.com>
wrote:

> I work for a mobile game company. I'm solving a simple question: "Can we
> efficiently/cheaply query for the log of a particular user within given
> date period?"
>
> I've created a special JSON text-based file format that has these traits:
>  - Snappy compressed, saved in AWS S3
>  - Partitioned by date. ie. 2017-01-01.sz, 2017-01-02.sz, ...
>  - Sorted by a primary key (log_type) and a secondary key (user_id),
> Snappy block compressed by 5MB blocks
>  - Blocks are indexed with primary/secondary key in file 2017-01-01.json
>  - Efficient block based random access on primary key (log_type) and
> secondary key (user_id) using the index
>
> I've created a Spark SQL DataFrame relation that can query this file
> format.  Since the schema of each log type is fairly consistent, I've
> reused the `InferSchema.inferSchema` method and `JacksonParser`in the Spark
> SQL code to support structured querying.  I've also implemented filter
> push-down to optimize the file access.
>
> It is very fast when querying for a single user or querying for a single
> log type with a sampling ratio of 10000 to 1 compared to parquet file
> format.  (We do use parquet for some log types when we need batch analysis.)
>
> One of the problems we face is that the methods we use above are private
> API.  So we are forced to use hacks to use these methods.  (Things like
> copying the code or using the org.apache.spark.sql package namespace)
>
> I've been following Spark SQL code since 1.4, and the JSON schema
> inferencing code and JacksonParser seem to be relatively stable recently.
> Can the core-devs make these APIs public?
>
> We are willing to open source this file format because it is very
> excellent for archiving user related logs in S3.  The key dependency of
> private APIs in Spark SQL is the main hurdle in making this a reality.
>
> Thank you for reading!
>
>
>
>

Re: [Spark SQL] Making InferSchema and JacksonParser public

Reply via email to