subject:"Multi\-Line JSON in SparkSQL"

Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Ewan Higgs


FWIW, CSV has the same problem that renders it immune to naive partitioning.

Consider the following RFC 4180 compliant record:

1,2,
all,of,these,are,just,one,field
,4,5

Now, it's probably a terrible idea to give a file system awareness of 
actual file types, but couldn't HDFS handle this nearer the replication 
level? XML, JSON, and CSV are so pervasive it almost seems like it could 
be appropriate -if- enormous JSON files are considered enough of an 
issue that some basic ETL becomes a non viable solution.


-Ewan

On 05/05/15 09:37, Joe Halliwell wrote:

@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?




I suspect the algorithm is going to be bit fiddly and would definitely benefit 
from multiple heads. If possible, I think we should handle pathological cases 
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.




JSON grammar is simple enough that this feels tractable. (I wonder if there’s 
research on “start anywhere” languages/parsers in general...)




Cheers,

Joe


http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:


@joe, I'd be glad to help if you need.
Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
écrit :

I don't know whether this is common, but we might also allow another
separator for JSON objects, such as two blank lines.

Matei


On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:

Joe - I think that's a legit and useful thing to do. Do you want to give

it

a shot?

On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
wrote:


I think Reynold’s argument shows the impossibility of the general case.

But a “maximum object depth” hint could enable a new input format to do
its job both efficiently and correctly in the common case where the

input

is an array of similarly structured objects! I’d certainly be

interested in

an implementation along those lines.

Cheers,
Joe

http://www.joehalliwell.com
@joehalliwell


On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com

wrote:

I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first {

starting

from a random point. However, that random point could be in the middle

of

a
string, and thus the first { might just be part of a string, rather

than

a
real JSON object starting position.


On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
wrote:


You can check out the following library:

https://github.com/alexholmes/json-mapreduce

--
Emre Sevinç


On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:


Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data

efficiently, I

think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat

But it's rather inaccessible considering the dependency is not

available

in

any public maven repo (If you know of one, I'd be glad to hear it).

Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that

sqlContext.jsonFile

will

not work for multi-line json(s))

Regards,

Olivier.




--
Emre Sevinc








-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Joe Halliwell

@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?




I suspect the algorithm is going to be bit fiddly and would definitely benefit 
from multiple heads. If possible, I think we should handle pathological cases 
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.




JSON grammar is simple enough that this feels tractable. (I wonder if there’s 
research on “start anywhere” languages/parsers in general...)




Cheers,

Joe


http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:

 @joe, I'd be glad to help if you need.
 Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
 écrit :
 I don't know whether this is common, but we might also allow another
 separator for JSON objects, such as two blank lines.

 Matei

  On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:
 
  Joe - I think that's a legit and useful thing to do. Do you want to give
 it
  a shot?
 
  On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
  wrote:
 
  I think Reynold’s argument shows the impossibility of the general case.
 
  But a “maximum object depth” hint could enable a new input format to do
  its job both efficiently and correctly in the common case where the
 input
  is an array of similarly structured objects! I’d certainly be
 interested in
  an implementation along those lines.
 
  Cheers,
  Joe
 
  http://www.joehalliwell.com
  @joehalliwell
 
 
  On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com
 wrote:
 
  I took a quick look at that implementation. I'm not sure if it actually
  handles JSON correctly, because it attempts to find the first {
 starting
  from a random point. However, that random point could be in the middle
 of
  a
  string, and thus the first { might just be part of a string, rather
 than
  a
  real JSON object starting position.
 
 
  On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
  wrote:
 
  You can check out the following library:
 
  https://github.com/alexholmes/json-mapreduce
 
  --
  Emre Sevinç
 
 
  On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
  Hi everyone,
  Is there any way in Spark SQL to load multi-line JSON data
  efficiently, I
  think there was in the mailing list a reference to
  http://pivotal-field-engineering.github.io/pmr-common/ for its
  JSONInputFormat
 
  But it's rather inaccessible considering the dependency is not
  available
  in
  any public maven repo (If you know of one, I'd be glad to hear it).
 
  Is there any plan to address this or any public recommendation ?
  (considering the documentation clearly states that
  sqlContext.jsonFile
  will
  not work for multi-line json(s))
 
  Regards,
 
  Olivier.
 
 
 
 
  --
  Emre Sevinc

Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Joe Halliwell

I've raised the JSON-related ticket at
https://issues.apache.org/jira/browse/SPARK-7366.

@Ewan I think it would be great to support multiline CSV records too.
The motivation is very similar but my instinct is that little/nothing
of the implementation could be usefully shared, so it's better as a
separate ticket?

Cheers,
Joe

On 5 May 2015 at 08:51, Ewan Higgs ewan.hi...@ugent.be wrote:
FWIW, CSV has the same problem that renders it immune to naive partitioning.

Consider the following RFC 4180 compliant record:

1,2,
all,of,these,are,just,one,field
,4,5

Now, it's probably a terrible idea to give a file system awareness of actual
file types, but couldn't HDFS handle this nearer the replication level? XML,
JSON, and CSV are so pervasive it almost seems like it could be appropriate
-if- enormous JSON files are considered enough of an issue that some basic
ETL becomes a non viable solution.

-Ewan

On 05/05/15 09:37, Joe Halliwell wrote:

@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?

I suspect the algorithm is going to be bit fiddly and would definitely
benefit from multiple heads. If possible, I think we should handle
pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing
out.

JSON grammar is simple enough that this feels tractable. (I wonder if
there’s research on “start anywhere” languages/parsers in general...)

Cheers,

Joe

http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:

@joe, I'd be glad to help if you need.
Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
écrit :

I don't know whether this is common, but we might also allow another
separator for JSON objects, such as two blank lines.

Matei

On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:

Joe - I think that's a legit and useful thing to do. Do you want to
give

a shot?

On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell
joe.halliw...@gmail.com
wrote:

I think Reynold’s argument shows the impossibility of the general
case.

But a “maximum object depth” hint could enable a new input format to
do
its job both efficiently and correctly in the common case where the

input

is an array of similarly structured objects! I’d certainly be

interested in

an implementation along those lines.

Cheers,
Joe

http://www.joehalliwell.com
@joehalliwell

On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com

wrote:

I took a quick look at that implementation. I'm not sure if it
actually
handles JSON correctly, because it attempts to find the first {

starting

from a random point. However, that random point could be in the
middle

a
string, and thus the first { might just be part of a string, rather

than

a
real JSON object starting position.

On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
wrote:

You can check out the following library:

https://github.com/alexholmes/json-mapreduce

--
Emre Sevinç

On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:

Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data

efficiently, I

think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat

But it's rather inaccessible considering the dependency is not

available

any public maven repo (If you know of one, I'd be glad to hear it).

Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that

sqlContext.jsonFile

will

not work for multi-line json(s))

Regards,

Olivier.

--
Emre Sevinc

--
Best regards,
Joe

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Emre Sevinc

You can check out the following library:

   https://github.com/alexholmes/json-mapreduce

--
Emre Sevinç


On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not available in
 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that sqlContext.jsonFile will
 not work for multi-line json(s))

 Regards,

 Olivier.




-- 
Emre Sevinc

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin

I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first { starting
from a random point. However, that random point could be in the middle of a
string, and thus the first { might just be part of a string, rather than a
real JSON object starting position.


On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 You can check out the following library:

https://github.com/alexholmes/json-mapreduce

 --
 Emre Sevinç


 On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

  Hi everyone,
  Is there any way in Spark SQL to load multi-line JSON data efficiently, I
  think there was in the mailing list a reference to
  http://pivotal-field-engineering.github.io/pmr-common/ for its
  JSONInputFormat
 
  But it's rather inaccessible considering the dependency is not available
 in
  any public maven repo (If you know of one, I'd be glad to hear it).
 
  Is there any plan to address this or any public recommendation ?
  (considering the documentation clearly states that sqlContext.jsonFile
 will
  not work for multi-line json(s))
 
  Regards,
 
  Olivier.
 



 --
 Emre Sevinc

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Joe Halliwell

I think Reynold’s argument shows the impossibility of the general case.




But a “maximum object depth” hint could enable a new input format to do its job 
both efficiently and correctly in the common case where the input is an array 
of similarly structured objects! I’d certainly be interested in an 
implementation along those lines.




Cheers,

Joe



http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:

 I took a quick look at that implementation. I'm not sure if it actually
 handles JSON correctly, because it attempts to find the first { starting
 from a random point. However, that random point could be in the middle of a
 string, and thus the first { might just be part of a string, rather than a
 real JSON object starting position.
 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote:
 You can check out the following library:

https://github.com/alexholmes/json-mapreduce

 --
 Emre Sevinç


 On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

  Hi everyone,
  Is there any way in Spark SQL to load multi-line JSON data efficiently, I
  think there was in the mailing list a reference to
  http://pivotal-field-engineering.github.io/pmr-common/ for its
  JSONInputFormat
 
  But it's rather inaccessible considering the dependency is not available
 in
  any public maven repo (If you know of one, I'd be glad to hear it).
 
  Is there any plan to address this or any public recommendation ?
  (considering the documentation clearly states that sqlContext.jsonFile
 will
  not work for multi-line json(s))
 
  Regards,
 
  Olivier.
 



 --
 Emre Sevinc

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot

I was wondering if it's possible to use existing Hive SerDes for this ?

Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a
écrit :

 I think Reynold’s argument shows the impossibility of the general case.

 But a “maximum object depth” hint could enable a new input format to do
 its job both efficiently and correctly in the common case where the input
 is an array of similarly structured objects! I’d certainly be interested in
 an implementation along those lines.

 Cheers,
 Joe

 http://www.joehalliwell.com
 @joehalliwell


 On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:

 I took a quick look at that implementation. I'm not sure if it actually
 handles JSON correctly, because it attempts to find the first { starting
 from a random point. However, that random point could be in the middle of
 a
 string, and thus the first { might just be part of a string, rather than
 a
 real JSON object starting position.


 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:

  You can check out the following library:
 
  https://github.com/alexholmes/json-mapreduce
 
  --
  Emre Sevinç
 
 
  On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
   Hi everyone,
   Is there any way in Spark SQL to load multi-line JSON data
 efficiently, I
   think there was in the mailing list a reference to
   http://pivotal-field-engineering.github.io/pmr-common/ for its
   JSONInputFormat
  
   But it's rather inaccessible considering the dependency is not
 available
  in
   any public maven repo (If you know of one, I'd be glad to hear it).
  
   Is there any plan to address this or any public recommendation ?
   (considering the documentation clearly states that
 sqlContext.jsonFile
  will
   not work for multi-line json(s))
  
   Regards,
  
   Olivier.
  
 
 
 
  --
  Emre Sevinc

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Paul Brown

It's not JSON, per se, but data formats like smile (
http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide
support for markers that can't be confused with content and also provide
reasonably similar ergonomics.

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Mon, May 4, 2015 at 5:43 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 I was wondering if it's possible to use existing Hive SerDes for this ?

 Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a
 écrit :

  I think Reynold’s argument shows the impossibility of the general case.
 
  But a “maximum object depth” hint could enable a new input format to do
  its job both efficiently and correctly in the common case where the input
  is an array of similarly structured objects! I’d certainly be interested
 in
  an implementation along those lines.
 
  Cheers,
  Joe
 
  http://www.joehalliwell.com
  @joehalliwell
 
 
  On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:
 
  I took a quick look at that implementation. I'm not sure if it actually
  handles JSON correctly, because it attempts to find the first { starting
  from a random point. However, that random point could be in the middle
 of
  a
  string, and thus the first { might just be part of a string, rather than
  a
  real JSON object starting position.
 
 
  On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
  wrote:
 
   You can check out the following library:
  
   https://github.com/alexholmes/json-mapreduce
  
   --
   Emre Sevinç
  
  
   On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
   o.girar...@lateral-thoughts.com wrote:
  
Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data
  efficiently, I
think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat
   
But it's rather inaccessible considering the dependency is not
  available
   in
any public maven repo (If you know of one, I'd be glad to hear it).
   
Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that
  sqlContext.jsonFile
   will
not work for multi-line json(s))
   
Regards,
   
Olivier.
   
  
  
  
   --
   Emre Sevinc

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin

Joe - I think that's a legit and useful thing to do. Do you want to give it
a shot?

On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
wrote:

 I think Reynold’s argument shows the impossibility of the general case.

 But a “maximum object depth” hint could enable a new input format to do
 its job both efficiently and correctly in the common case where the input
 is an array of similarly structured objects! I’d certainly be interested in
 an implementation along those lines.

 Cheers,
 Joe

 http://www.joehalliwell.com
 @joehalliwell


 On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:

 I took a quick look at that implementation. I'm not sure if it actually
 handles JSON correctly, because it attempts to find the first { starting
 from a random point. However, that random point could be in the middle of
 a
 string, and thus the first { might just be part of a string, rather than
 a
 real JSON object starting position.


 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:

  You can check out the following library:
 
  https://github.com/alexholmes/json-mapreduce
 
  --
  Emre Sevinç
 
 
  On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
   Hi everyone,
   Is there any way in Spark SQL to load multi-line JSON data
 efficiently, I
   think there was in the mailing list a reference to
   http://pivotal-field-engineering.github.io/pmr-common/ for its
   JSONInputFormat
  
   But it's rather inaccessible considering the dependency is not
 available
  in
   any public maven repo (If you know of one, I'd be glad to hear it).
  
   Is there any plan to address this or any public recommendation ?
   (considering the documentation clearly states that
 sqlContext.jsonFile
  will
   not work for multi-line json(s))
  
   Regards,
  
   Olivier.
  
 
 
 
  --
  Emre Sevinc

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Matei Zaharia

I don't know whether this is common, but we might also allow another separator 
for JSON objects, such as two blank lines.

Matei

 On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:
 
 Joe - I think that's a legit and useful thing to do. Do you want to give it
 a shot?
 
 On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
 wrote:
 
 I think Reynold’s argument shows the impossibility of the general case.
 
 But a “maximum object depth” hint could enable a new input format to do
 its job both efficiently and correctly in the common case where the input
 is an array of similarly structured objects! I’d certainly be interested in
 an implementation along those lines.
 
 Cheers,
 Joe
 
 http://www.joehalliwell.com
 @joehalliwell
 
 
 On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:
 
 I took a quick look at that implementation. I'm not sure if it actually
 handles JSON correctly, because it attempts to find the first { starting
 from a random point. However, that random point could be in the middle of
 a
 string, and thus the first { might just be part of a string, rather than
 a
 real JSON object starting position.
 
 
 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:
 
 You can check out the following library:
 
 https://github.com/alexholmes/json-mapreduce
 
 --
 Emre Sevinç
 
 
 On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:
 
 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data
 efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat
 
 But it's rather inaccessible considering the dependency is not
 available
 in
 any public maven repo (If you know of one, I'd be glad to hear it).
 
 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that
 sqlContext.jsonFile
 will
 not work for multi-line json(s))
 
 Regards,
 
 Olivier.
 
 
 
 
 --
 Emre Sevinc
 
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot

@joe, I'd be glad to help if you need.

Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
écrit :

 I don't know whether this is common, but we might also allow another
 separator for JSON objects, such as two blank lines.

 Matei

  On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:
 
  Joe - I think that's a legit and useful thing to do. Do you want to give
 it
  a shot?
 
  On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
  wrote:
 
  I think Reynold’s argument shows the impossibility of the general case.
 
  But a “maximum object depth” hint could enable a new input format to do
  its job both efficiently and correctly in the common case where the
 input
  is an array of similarly structured objects! I’d certainly be
 interested in
  an implementation along those lines.
 
  Cheers,
  Joe
 
  http://www.joehalliwell.com
  @joehalliwell
 
 
  On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com
 wrote:
 
  I took a quick look at that implementation. I'm not sure if it actually
  handles JSON correctly, because it attempts to find the first {
 starting
  from a random point. However, that random point could be in the middle
 of
  a
  string, and thus the first { might just be part of a string, rather
 than
  a
  real JSON object starting position.
 
 
  On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
  wrote:
 
  You can check out the following library:
 
  https://github.com/alexholmes/json-mapreduce
 
  --
  Emre Sevinç
 
 
  On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
  Hi everyone,
  Is there any way in Spark SQL to load multi-line JSON data
  efficiently, I
  think there was in the mailing list a reference to
  http://pivotal-field-engineering.github.io/pmr-common/ for its
  JSONInputFormat
 
  But it's rather inaccessible considering the dependency is not
  available
  in
  any public maven repo (If you know of one, I'd be glad to hear it).
 
  Is there any plan to address this or any public recommendation ?
  (considering the documentation clearly states that
  sqlContext.jsonFile
  will
  not work for multi-line json(s))
 
  Regards,
 
  Olivier.
 
 
 
 
  --
  Emre Sevinc

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Reynold Xin

How does the pivotal format decides where to split the files? It seems to
me the challenge is to decide that, and on the top of my head the only way
to do this is to scan from the beginning and parse the json properly, which
makes it not possible with large files (doable for whole input with a lot
of small files though). If there is a better way, we should do it.


On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not available in
 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that sqlContext.jsonFile will
 not work for multi-line json(s))

 Regards,

 Olivier.

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot

I'll try to study that and get back to you.
Regards,

Olivier.

Le lun. 4 mai 2015 à 04:05, Reynold Xin r...@databricks.com a écrit :

 How does the pivotal format decides where to split the files? It seems to
 me the challenge is to decide that, and on the top of my head the only way
 to do this is to scan from the beginning and parse the json properly, which
 makes it not possible with large files (doable for whole input with a lot
 of small files though). If there is a better way, we should do it.


 On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not available
 in
 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that sqlContext.jsonFile
 will
 not work for multi-line json(s))

 Regards,

 Olivier.

Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot

Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data efficiently, I
think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat

But it's rather inaccessible considering the dependency is not available in
any public maven repo (If you know of one, I'd be glad to hear it).

Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that sqlContext.jsonFile will
not work for multi-line json(s))

Regards,

Olivier.

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Re: Multi-Line JSON in SparkSQL

Multi-Line JSON in SparkSQL

14 matches

Site Navigation

Mail list logo

Footer information