Re: Representing a recursive data type in Spark SQL

Matei Zaharia Thu, 28 May 2015 19:03:08 -0700

Your best bet might be to use a map<string,string> in SQL and make the keys be 
longer paths (e.g. params_param1 and params_param2). I don't think you can have 
a map in some of them but not in others.


Matei

> On May 28, 2015, at 3:48 PM, Jeremy Lucas <jeremyalu...@gmail.com> wrote:
> 
> Hey Reynold,
> 
> Thanks for the suggestion. Maybe a better definition of what I mean by a 
> "recursive" data structure is rather what might resemble (in Scala) the type 
> Map[String, Any]. With a type like this, the keys are well-defined as strings 
> (as this is JSON) but the values can be basically any arbitrary value, 
> including another Map[String, Any].
> 
> For example, in the below "stream" of JSON records:
> 
> {
>   "timestamp": "2015-01-01T00:00:00Z",
>   "data": {
>     "event": "click",
>     "url": "http://mywebsite.com <http://mywebsite.com/>"
>   }
> }
> ...
> {
>   "timestamp": "2015-01-01T08:00:00Z",
>   "data": {
>     "event": "purchase",
>     "sku": "123456789",
>     "quantity": 1,
>     "params": {
>       "arbitrary-param-1": "blah",
>       "arbitrary-param-2": 123456
>   }
> }
> 
> I am trying to figure out a way to run SparkSQL over the above JSON records. 
> My inclination would be to define the "timestamp" field as a well-defined 
> DateType, but the "data" field is way more free-form.
> 
> Also, any pointers on where to look for how data types are evaluated and 
> serialized/deserialized would be super helpful as well.
> 
> Thanks
> 
> 
> 
> On Thu, May 28, 2015 at 12:30 AM Reynold Xin <r...@databricks.com 
> <mailto:r...@databricks.com>> wrote:
> I think it is fairly hard to support recursive data types. What I've seen in 
> one other proprietary system in the past is to let the user define the depth 
> of the nested data types, and then just expand the struct/map/list definition 
> to the maximum level of depth.
> 
> Would this solve your problem?
> 
> 
> 
> 
> On Wed, May 20, 2015 at 6:07 PM, Jeremy Lucas <jeremyalu...@gmail.com 
> <mailto:jeremyalu...@gmail.com>> wrote:
> Hey Rakesh,
> 
> To clarify, what I was referring to is when doing something like this:
> 
> sqlContext.applySchema(rdd, mySchema)
> 
> mySchema must be a well-defined StructType, which presently does not allow 
> for a recursive type.
> 
> 
> On Wed, May 20, 2015 at 5:39 PM Rakesh Chalasani <vnit.rak...@gmail.com 
> <mailto:vnit.rak...@gmail.com>> wrote:
> Hi Jeremy:
> 
> Row is a collect of 'Any'. So, you can be used as a recursive data type. Is 
> this what you were looking for?
> 
> Example:
> val x = sc.parallelize(Array.range(0,10)).map(x => Row(Row(x), 
> Row(x.toString)))
> 
> Rakesh
> 
> 
> 
> On Wed, May 20, 2015 at 7:23 PM Jeremy Lucas <jeremyalu...@gmail.com 
> <mailto:jeremyalu...@gmail.com>> wrote:
> Spark SQL has proven to be quite useful in applying a partial schema to large 
> JSON logs and being able to write plain SQL to perform a wide variety of 
> operations over this data. However, one small thing that keeps coming back to 
> haunt me is the lack of support for recursive data types, whereby a member of 
> a complex/struct value can be of the same type as the complex/struct value 
> itself.
> I am hoping someone may be able to point me in the right direction of where 
> to start to build out such capabilities, as I'd be happy to contribute, but 
> am very new to this particular component of the Spark project.
>

Re: Representing a recursive data type in Spark SQL

Reply via email to