Personally I'd love to see some kind of pluggability, configurability in the 
JSON schema parsing, maybe as an option in the DataFrameReader. Perhaps you can 
propose an API?

> On Jan 18, 2017, at 5:51 AM, Brian Hong <sungjinh...@devsisters.com> wrote:
> 
> I work for a mobile game company. I'm solving a simple question: "Can we 
> efficiently/cheaply query for the log of a particular user within given date 
> period?"
> 
> I've created a special JSON text-based file format that has these traits:
>  - Snappy compressed, saved in AWS S3
>  - Partitioned by date. ie. 2017-01-01.sz <http://2017-01-01.sz/>, 
> 2017-01-02.sz <http://2017-01-02.sz/>, ...
>  - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy 
> block compressed by 5MB blocks
>  - Blocks are indexed with primary/secondary key in file 2017-01-01.json
>  - Efficient block based random access on primary key (log_type) and 
> secondary key (user_id) using the index
> 
> I've created a Spark SQL DataFrame relation that can query this file format.  
> Since the schema of each log type is fairly consistent, I've reused the 
> `InferSchema.inferSchema` method and `JacksonParser`in the Spark SQL code to 
> support structured querying.  I've also implemented filter push-down to 
> optimize the file access.
> 
> It is very fast when querying for a single user or querying for a single log 
> type with a sampling ratio of 10000 to 1 compared to parquet file format.  
> (We do use parquet for some log types when we need batch analysis.)
> 
> One of the problems we face is that the methods we use above are private API. 
>  So we are forced to use hacks to use these methods.  (Things like copying 
> the code or using the org.apache.spark.sql package namespace)
> 
> I've been following Spark SQL code since 1.4, and the JSON schema inferencing 
> code and JacksonParser seem to be relatively stable recently.  Can the 
> core-devs make these APIs public?
> 
> We are willing to open source this file format because it is very excellent 
> for archiving user related logs in S3.  The key dependency of private APIs in 
> Spark SQL is the main hurdle in making this a reality.
> 
> Thank you for reading!
> 

Reply via email to