RE: Dataframe nested schema inference from Json without type conflicts

2015-10-23 Thread Ewan Leith
Hi all,

It’s taken us a while, but one of my colleagues has made the pull request on 
github for our proposed solution to this,

https://issues.apache.org/jira/browse/SPARK-10947
https://github.com/apache/spark/pull/9249

It adds a parameter to the Json read otpions to force all primitives as a 
String type:

val jsonDf = sqlContext.read.option("primitivesAsString", 
"true").json(sampleJsonFile)

scala> jsonDf.printSchema()
root
|-- bigInteger: string (nullable = true)
|-- boolean: string (nullable = true)
|-- double: string (nullable = true)
|-- integer: string (nullable = true)
|-- long: string (nullable = true)
|-- null: string (nullable = true)
|-- string: string (nullable = true)

Thanks,
Ewan

From: Yin Huai [mailto:yh...@databricks.com]
Sent: 01 October 2015 23:54
To: Ewan Leith 
Cc: r...@databricks.com; dev@spark.apache.org
Subject: Re: Dataframe nested schema inference from Json without type conflicts

Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.



Thanks,

Ewan



-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we’re using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it’ll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we’d rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don’t think there’s currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan




Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
Thanks Yin, I'll put together a JIRA and a PR tomorrow.


Ewan


-- Original message--

From: Yin Huai

Date: Mon, 5 Oct 2015 17:39

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hello Ewan,

Adding a JSON-specific option makes sense. Can you open a JIRA for this? Also, 
sending out a PR will be great. For JSONRelation, I think we can pass all 
user-specific options to it (see 
org.apache.spark.sql.execution.datasources.json.DefaultSource's createRelation) 
just like what we do for ParquetRelation. Then, inside JSONRelation, we figure 
out what kind of options that have been specified.

Thanks,

Yin

On Mon, Oct 5, 2015 at 9:04 AM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:
I've done some digging today and, as a quick and ugly fix, altering the case 
statement of the JSON inferField function in InferSchema.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala

to have

case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE | 
VALUE_FALSE => StringType

rather than the rules for each type works as we'd want.

If we were to wrap this up in a configuration setting in JSONRelation like the 
samplingRatio setting, with the default being to behave as it currently works, 
does anyone think a pull request would plausibly get into the Spark main 
codebase?

Thanks,
Ewan



From: Ewan Leith 
[mailto:ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>]
Sent: 02 October 2015 01:57
To: yh...@databricks.com<mailto:yh...@databricks.com>

Cc: r...@databricks.com<mailto:r...@databricks.com>; 
dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Dataframe nested schema inference from Json without type conflicts


Exactly, that's a much better way to put it.



Thanks,

Ewan



-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: 
r...@databricks.com;dev@spark.apache.org<mailto:r...@databricks.com;dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.



Thanks,

Ewan



-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan





Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Yin Huai
Hello Ewan,

Adding a JSON-specific option makes sense. Can you open a JIRA for this?
Also, sending out a PR will be great. For JSONRelation, I think we can pass
all user-specific options to it (see
org.apache.spark.sql.execution.datasources.json.DefaultSource's
createRelation) just like what we do for ParquetRelation. Then, inside
JSONRelation, we figure out what kind of options that have been specified.

Thanks,

Yin

On Mon, Oct 5, 2015 at 9:04 AM, Ewan Leith 
wrote:

> I’ve done some digging today and, as a quick and ugly fix, altering the
> case statement of the JSON inferField function in InferSchema.scala
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
>
>
>
> to have
>
>
>
> case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE |
> VALUE_FALSE => StringType
>
>
>
> rather than the rules for each type works as we’d want.
>
>
>
> If we were to wrap this up in a configuration setting in JSONRelation like
> the samplingRatio setting, with the default being to behave as it currently
> works, does anyone think a pull request would plausibly get into the Spark
> main codebase?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>
>
>
> *From:* Ewan Leith [mailto:ewan.le...@realitymine.com]
> *Sent:* 02 October 2015 01:57
> *To:* yh...@databricks.com
>
> *Cc:* r...@databricks.com; dev@spark.apache.org
> *Subject:* Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> Exactly, that's a much better way to put it.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> -- Original message--
>
> *From: *Yin Huai
>
> *Date: *Thu, 1 Oct 2015 23:54
>
> *To: *Ewan Leith;
>
> *Cc: *r...@databricks.com;dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> Hi Ewan,
>
>
>
> For your use case, you only need the schema inference to pick up the
> structure of your data (basically you want spark sql to infer the type of
> complex values like arrays and structs but keep the type of primitive
> values as strings), right?
>
>
>
> Thanks,
>
>
>
> Yin
>
>
>
> On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
> wrote:
>
> We could, but if a client sends some unexpected records in the schema
> (which happens more than I'd like, our schema seems to constantly evolve),
> its fantastic how Spark picks up on that data and includes it.
>
>
>
> Passing in a fixed schema loses that nice additional ability, though it's
> what we'll probably have to adopt if we can't come up with a way to keep
> the inference working.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> -- Original message--
>
> *From: *Reynold Xin
>
> *Date: *Thu, 1 Oct 2015 22:12
>
> *To: *Ewan Leith;
>
> *Cc: *dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> You can pass the schema into json directly, can't you?
>
>
>
> On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
>
> Hi all,
>
>
>
> We really like the ability to infer a schema from JSON contained in an
> RDD, but when we’re using Spark Streaming on small batches of data, we
> sometimes find that Spark infers a more specific type than it should use,
> for example if the json in that small batch only contains integer values
> for a String field, it’ll class the field as an Integer type on one
> Streaming batch, then a String on the next one.
>
>
>
> Instead, we’d rather match every value as a String type, then handle any
> casting to a desired type later in the process.
>
>
>
> I don’t think there’s currently any simple way to avoid this that I can
> see, but we could add the functionality in the JacksonParser.scala file,
> probably in convertField.
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>
>
>
> Does anyone know an easier and cleaner way to do this?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>


RE: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
I've done some digging today and, as a quick and ugly fix, altering the case 
statement of the JSON inferField function in InferSchema.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala

to have

case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE | 
VALUE_FALSE => StringType

rather than the rules for each type works as we'd want.

If we were to wrap this up in a configuration setting in JSONRelation like the 
samplingRatio setting, with the default being to behave as it currently works, 
does anyone think a pull request would plausibly get into the Spark main 
codebase?

Thanks,
Ewan



From: Ewan Leith [mailto:ewan.le...@realitymine.com]
Sent: 02 October 2015 01:57
To: yh...@databricks.com
Cc: r...@databricks.com; dev@spark.apache.org
Subject: Re: Dataframe nested schema inference from Json without type conflicts


Exactly, that's a much better way to put it.



Thanks,

Ewan



-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: 
r...@databricks.com;dev@spark.apache.org<mailto:r...@databricks.com;dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.



Thanks,

Ewan



-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan




Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
Exactly, that's a much better way to put it.


Thanks,

Ewan


-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: r...@databricks.com;dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.


Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.


Thanks,

Ewan


-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan




Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Yin Huai
Hi Ewan,

For your use case, you only need the schema inference to pick up the
structure of your data (basically you want spark sql to infer the type of
complex values like arrays and structs but keep the type of primitive
values as strings), right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
wrote:

> We could, but if a client sends some unexpected records in the schema
> (which happens more than I'd like, our schema seems to constantly evolve),
> its fantastic how Spark picks up on that data and includes it.
>
>
> Passing in a fixed schema loses that nice additional ability, though it's
> what we'll probably have to adopt if we can't come up with a way to keep
> the inference working.
>
>
> Thanks,
>
> Ewan
>
>
> -- Original message--
>
> *From: *Reynold Xin
>
> *Date: *Thu, 1 Oct 2015 22:12
>
> *To: *Ewan Leith;
>
> *Cc: *dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
> You can pass the schema into json directly, can't you?
>
> On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
> wrote:
>
>> Hi all,
>>
>>
>>
>> We really like the ability to infer a schema from JSON contained in an
>> RDD, but when we’re using Spark Streaming on small batches of data, we
>> sometimes find that Spark infers a more specific type than it should use,
>> for example if the json in that small batch only contains integer values
>> for a String field, it’ll class the field as an Integer type on one
>> Streaming batch, then a String on the next one.
>>
>>
>>
>> Instead, we’d rather match every value as a String type, then handle any
>> casting to a desired type later in the process.
>>
>>
>>
>> I don’t think there’s currently any simple way to avoid this that I can
>> see, but we could add the functionality in the JacksonParser.scala file,
>> probably in convertField.
>>
>>
>>
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>>
>>
>>
>> Does anyone know an easier and cleaner way to do this?
>>
>>
>>
>> Thanks,
>>
>> Ewan
>>
>
>


Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.


Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.


Thanks,

Ewan


-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan



Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Reynold Xin
You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
wrote:

> Hi all,
>
>
>
> We really like the ability to infer a schema from JSON contained in an
> RDD, but when we’re using Spark Streaming on small batches of data, we
> sometimes find that Spark infers a more specific type than it should use,
> for example if the json in that small batch only contains integer values
> for a String field, it’ll class the field as an Integer type on one
> Streaming batch, then a String on the next one.
>
>
>
> Instead, we’d rather match every value as a String type, then handle any
> casting to a desired type later in the process.
>
>
>
> I don’t think there’s currently any simple way to avoid this that I can
> see, but we could add the functionality in the JacksonParser.scala file,
> probably in convertField.
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>
>
>
> Does anyone know an easier and cleaner way to do this?
>
>
>
> Thanks,
>
> Ewan
>