Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Ewan Higgs

FWIW, CSV has the same problem that renders it immune to naive partitioning.

Consider the following RFC 4180 compliant record:

1,2,
all,of,these,are,just,one,field
,4,5

Now, it's probably a terrible idea to give a file system awareness of 
actual file types, but couldn't HDFS handle this nearer the replication 
level? XML, JSON, and CSV are so pervasive it almost seems like it could 
be appropriate -if- enormous JSON files are considered enough of an 
issue that some basic ETL becomes a non viable solution.


-Ewan

On 05/05/15 09:37, Joe Halliwell wrote:

@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?




I suspect the algorithm is going to be bit fiddly and would definitely benefit 
from multiple heads. If possible, I think we should handle pathological cases 
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.




JSON grammar is simple enough that this feels tractable. (I wonder if there’s 
research on “start anywhere” languages/parsers in general...)




Cheers,

Joe


http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:


@joe, I'd be glad to help if you need.
Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
écrit :

I don't know whether this is common, but we might also allow another
separator for JSON objects, such as two blank lines.

Matei


On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:

Joe - I think that's a legit and useful thing to do. Do you want to give

it

a shot?

On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
wrote:


I think Reynold’s argument shows the impossibility of the general case.

But a “maximum object depth” hint could enable a new input format to do
its job both efficiently and correctly in the common case where the

input

is an array of similarly structured objects! I’d certainly be

interested in

an implementation along those lines.

Cheers,
Joe

http://www.joehalliwell.com
@joehalliwell


On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com

wrote:

I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first {

starting

from a random point. However, that random point could be in the middle

of

a
string, and thus the first { might just be part of a string, rather

than

a
real JSON object starting position.


On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
wrote:


You can check out the following library:

https://github.com/alexholmes/json-mapreduce

--
Emre Sevinç


On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:


Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data

efficiently, I

think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat

But it's rather inaccessible considering the dependency is not

available

in

any public maven repo (If you know of one, I'd be glad to hear it).

Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that

sqlContext.jsonFile

will

not work for multi-line json(s))

Regards,

Olivier.




--
Emre Sevinc








-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Joe Halliwell
@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?




I suspect the algorithm is going to be bit fiddly and would definitely benefit 
from multiple heads. If possible, I think we should handle pathological cases 
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.




JSON grammar is simple enough that this feels tractable. (I wonder if there’s 
research on “start anywhere” languages/parsers in general...)




Cheers,

Joe


http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:

 @joe, I'd be glad to help if you need.
 Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
 écrit :
 I don't know whether this is common, but we might also allow another
 separator for JSON objects, such as two blank lines.

 Matei

  On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:
 
  Joe - I think that's a legit and useful thing to do. Do you want to give
 it
  a shot?
 
  On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
  wrote:
 
  I think Reynold’s argument shows the impossibility of the general case.
 
  But a “maximum object depth” hint could enable a new input format to do
  its job both efficiently and correctly in the common case where the
 input
  is an array of similarly structured objects! I’d certainly be
 interested in
  an implementation along those lines.
 
  Cheers,
  Joe
 
  http://www.joehalliwell.com
  @joehalliwell
 
 
  On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com
 wrote:
 
  I took a quick look at that implementation. I'm not sure if it actually
  handles JSON correctly, because it attempts to find the first {
 starting
  from a random point. However, that random point could be in the middle
 of
  a
  string, and thus the first { might just be part of a string, rather
 than
  a
  real JSON object starting position.
 
 
  On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
  wrote:
 
  You can check out the following library:
 
  https://github.com/alexholmes/json-mapreduce
 
  --
  Emre Sevinç
 
 
  On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
  Hi everyone,
  Is there any way in Spark SQL to load multi-line JSON data
  efficiently, I
  think there was in the mailing list a reference to
  http://pivotal-field-engineering.github.io/pmr-common/ for its
  JSONInputFormat
 
  But it's rather inaccessible considering the dependency is not
  available
  in
  any public maven repo (If you know of one, I'd be glad to hear it).
 
  Is there any plan to address this or any public recommendation ?
  (considering the documentation clearly states that
  sqlContext.jsonFile
  will
  not work for multi-line json(s))
 
  Regards,
 
  Olivier.
 
 
 
 
  --
  Emre Sevinc
 
 
 
 



Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Joe Halliwell
I've raised the JSON-related ticket at
https://issues.apache.org/jira/browse/SPARK-7366.

@Ewan I think it would be great to support multiline CSV records too.
The motivation is very similar but my instinct is that little/nothing
of the implementation could be usefully shared, so it's better as a
separate ticket?

Cheers,
Joe

On 5 May 2015 at 08:51, Ewan Higgs ewan.hi...@ugent.be wrote:
 FWIW, CSV has the same problem that renders it immune to naive partitioning.

 Consider the following RFC 4180 compliant record:

 1,2,
 all,of,these,are,just,one,field
 ,4,5

 Now, it's probably a terrible idea to give a file system awareness of actual
 file types, but couldn't HDFS handle this nearer the replication level? XML,
 JSON, and CSV are so pervasive it almost seems like it could be appropriate
 -if- enormous JSON files are considered enough of an issue that some basic
 ETL becomes a non viable solution.

 -Ewan

 On 05/05/15 09:37, Joe Halliwell wrote:

 @reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?





 I suspect the algorithm is going to be bit fiddly and would definitely
 benefit from multiple heads. If possible, I think we should handle
 pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing
 out.




 JSON grammar is simple enough that this feels tractable. (I wonder if
 there’s research on “start anywhere” languages/parsers in general...)




 Cheers,

 Joe


 http://www.joehalliwell.com

 @joehalliwell

 On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
 o.girar...@lateral-thoughts.com wrote:

 @joe, I'd be glad to help if you need.
 Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
 écrit :

 I don't know whether this is common, but we might also allow another
 separator for JSON objects, such as two blank lines.

 Matei

 On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:

 Joe - I think that's a legit and useful thing to do. Do you want to
 give

 it

 a shot?

 On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell
 joe.halliw...@gmail.com
 wrote:

 I think Reynold’s argument shows the impossibility of the general
 case.

 But a “maximum object depth” hint could enable a new input format to
 do
 its job both efficiently and correctly in the common case where the

 input

 is an array of similarly structured objects! I’d certainly be

 interested in

 an implementation along those lines.

 Cheers,
 Joe

 http://www.joehalliwell.com
 @joehalliwell


 On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com

 wrote:

 I took a quick look at that implementation. I'm not sure if it
 actually
 handles JSON correctly, because it attempts to find the first {

 starting

 from a random point. However, that random point could be in the
 middle

 of

 a
 string, and thus the first { might just be part of a string, rather

 than

 a
 real JSON object starting position.


 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:

 You can check out the following library:

 https://github.com/alexholmes/json-mapreduce

 --
 Emre Sevinç


 On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data

 efficiently, I

 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not

 available

 in

 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that

 sqlContext.jsonFile

 will

 not work for multi-line json(s))

 Regards,

 Olivier.



 --
 Emre Sevinc







-- 
Best regards,
Joe

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Emre Sevinc
You can check out the following library:

   https://github.com/alexholmes/json-mapreduce

--
Emre Sevinç


On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not available in
 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that sqlContext.jsonFile will
 not work for multi-line json(s))

 Regards,

 Olivier.




-- 
Emre Sevinc


Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first { starting
from a random point. However, that random point could be in the middle of a
string, and thus the first { might just be part of a string, rather than a
real JSON object starting position.


On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 You can check out the following library:

https://github.com/alexholmes/json-mapreduce

 --
 Emre Sevinç


 On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

  Hi everyone,
  Is there any way in Spark SQL to load multi-line JSON data efficiently, I
  think there was in the mailing list a reference to
  http://pivotal-field-engineering.github.io/pmr-common/ for its
  JSONInputFormat
 
  But it's rather inaccessible considering the dependency is not available
 in
  any public maven repo (If you know of one, I'd be glad to hear it).
 
  Is there any plan to address this or any public recommendation ?
  (considering the documentation clearly states that sqlContext.jsonFile
 will
  not work for multi-line json(s))
 
  Regards,
 
  Olivier.
 



 --
 Emre Sevinc



Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Joe Halliwell
I think Reynold’s argument shows the impossibility of the general case.




But a “maximum object depth” hint could enable a new input format to do its job 
both efficiently and correctly in the common case where the input is an array 
of similarly structured objects! I’d certainly be interested in an 
implementation along those lines.




Cheers,

Joe



http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:

 I took a quick look at that implementation. I'm not sure if it actually
 handles JSON correctly, because it attempts to find the first { starting
 from a random point. However, that random point could be in the middle of a
 string, and thus the first { might just be part of a string, rather than a
 real JSON object starting position.
 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote:
 You can check out the following library:

https://github.com/alexholmes/json-mapreduce

 --
 Emre Sevinç


 On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

  Hi everyone,
  Is there any way in Spark SQL to load multi-line JSON data efficiently, I
  think there was in the mailing list a reference to
  http://pivotal-field-engineering.github.io/pmr-common/ for its
  JSONInputFormat
 
  But it's rather inaccessible considering the dependency is not available
 in
  any public maven repo (If you know of one, I'd be glad to hear it).
 
  Is there any plan to address this or any public recommendation ?
  (considering the documentation clearly states that sqlContext.jsonFile
 will
  not work for multi-line json(s))
 
  Regards,
 
  Olivier.
 



 --
 Emre Sevinc


Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
I was wondering if it's possible to use existing Hive SerDes for this ?

Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a
écrit :

 I think Reynold’s argument shows the impossibility of the general case.

 But a “maximum object depth” hint could enable a new input format to do
 its job both efficiently and correctly in the common case where the input
 is an array of similarly structured objects! I’d certainly be interested in
 an implementation along those lines.

 Cheers,
 Joe

 http://www.joehalliwell.com
 @joehalliwell


 On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:

 I took a quick look at that implementation. I'm not sure if it actually
 handles JSON correctly, because it attempts to find the first { starting
 from a random point. However, that random point could be in the middle of
 a
 string, and thus the first { might just be part of a string, rather than
 a
 real JSON object starting position.


 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:

  You can check out the following library:
 
  https://github.com/alexholmes/json-mapreduce
 
  --
  Emre Sevinç
 
 
  On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
   Hi everyone,
   Is there any way in Spark SQL to load multi-line JSON data
 efficiently, I
   think there was in the mailing list a reference to
   http://pivotal-field-engineering.github.io/pmr-common/ for its
   JSONInputFormat
  
   But it's rather inaccessible considering the dependency is not
 available
  in
   any public maven repo (If you know of one, I'd be glad to hear it).
  
   Is there any plan to address this or any public recommendation ?
   (considering the documentation clearly states that
 sqlContext.jsonFile
  will
   not work for multi-line json(s))
  
   Regards,
  
   Olivier.
  
 
 
 
  --
  Emre Sevinc
 





Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Paul Brown
It's not JSON, per se, but data formats like smile (
http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide
support for markers that can't be confused with content and also provide
reasonably similar ergonomics.

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Mon, May 4, 2015 at 5:43 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 I was wondering if it's possible to use existing Hive SerDes for this ?

 Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a
 écrit :

  I think Reynold’s argument shows the impossibility of the general case.
 
  But a “maximum object depth” hint could enable a new input format to do
  its job both efficiently and correctly in the common case where the input
  is an array of similarly structured objects! I’d certainly be interested
 in
  an implementation along those lines.
 
  Cheers,
  Joe
 
  http://www.joehalliwell.com
  @joehalliwell
 
 
  On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:
 
  I took a quick look at that implementation. I'm not sure if it actually
  handles JSON correctly, because it attempts to find the first { starting
  from a random point. However, that random point could be in the middle
 of
  a
  string, and thus the first { might just be part of a string, rather than
  a
  real JSON object starting position.
 
 
  On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
  wrote:
 
   You can check out the following library:
  
   https://github.com/alexholmes/json-mapreduce
  
   --
   Emre Sevinç
  
  
   On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
   o.girar...@lateral-thoughts.com wrote:
  
Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data
  efficiently, I
think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat
   
But it's rather inaccessible considering the dependency is not
  available
   in
any public maven repo (If you know of one, I'd be glad to hear it).
   
Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that
  sqlContext.jsonFile
   will
not work for multi-line json(s))
   
Regards,
   
Olivier.
   
  
  
  
   --
   Emre Sevinc
  
 
 
 



Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
Joe - I think that's a legit and useful thing to do. Do you want to give it
a shot?

On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
wrote:

 I think Reynold’s argument shows the impossibility of the general case.

 But a “maximum object depth” hint could enable a new input format to do
 its job both efficiently and correctly in the common case where the input
 is an array of similarly structured objects! I’d certainly be interested in
 an implementation along those lines.

 Cheers,
 Joe

 http://www.joehalliwell.com
 @joehalliwell


 On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:

 I took a quick look at that implementation. I'm not sure if it actually
 handles JSON correctly, because it attempts to find the first { starting
 from a random point. However, that random point could be in the middle of
 a
 string, and thus the first { might just be part of a string, rather than
 a
 real JSON object starting position.


 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:

  You can check out the following library:
 
  https://github.com/alexholmes/json-mapreduce
 
  --
  Emre Sevinç
 
 
  On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
   Hi everyone,
   Is there any way in Spark SQL to load multi-line JSON data
 efficiently, I
   think there was in the mailing list a reference to
   http://pivotal-field-engineering.github.io/pmr-common/ for its
   JSONInputFormat
  
   But it's rather inaccessible considering the dependency is not
 available
  in
   any public maven repo (If you know of one, I'd be glad to hear it).
  
   Is there any plan to address this or any public recommendation ?
   (considering the documentation clearly states that
 sqlContext.jsonFile
  will
   not work for multi-line json(s))
  
   Regards,
  
   Olivier.
  
 
 
 
  --
  Emre Sevinc
 





Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Matei Zaharia
I don't know whether this is common, but we might also allow another separator 
for JSON objects, such as two blank lines.

Matei

 On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:
 
 Joe - I think that's a legit and useful thing to do. Do you want to give it
 a shot?
 
 On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
 wrote:
 
 I think Reynold’s argument shows the impossibility of the general case.
 
 But a “maximum object depth” hint could enable a new input format to do
 its job both efficiently and correctly in the common case where the input
 is an array of similarly structured objects! I’d certainly be interested in
 an implementation along those lines.
 
 Cheers,
 Joe
 
 http://www.joehalliwell.com
 @joehalliwell
 
 
 On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote:
 
 I took a quick look at that implementation. I'm not sure if it actually
 handles JSON correctly, because it attempts to find the first { starting
 from a random point. However, that random point could be in the middle of
 a
 string, and thus the first { might just be part of a string, rather than
 a
 real JSON object starting position.
 
 
 On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:
 
 You can check out the following library:
 
 https://github.com/alexholmes/json-mapreduce
 
 --
 Emre Sevinç
 
 
 On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:
 
 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data
 efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat
 
 But it's rather inaccessible considering the dependency is not
 available
 in
 any public maven repo (If you know of one, I'd be glad to hear it).
 
 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that
 sqlContext.jsonFile
 will
 not work for multi-line json(s))
 
 Regards,
 
 Olivier.
 
 
 
 
 --
 Emre Sevinc
 
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
@joe, I'd be glad to help if you need.

Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
écrit :

 I don't know whether this is common, but we might also allow another
 separator for JSON objects, such as two blank lines.

 Matei

  On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:
 
  Joe - I think that's a legit and useful thing to do. Do you want to give
 it
  a shot?
 
  On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
  wrote:
 
  I think Reynold’s argument shows the impossibility of the general case.
 
  But a “maximum object depth” hint could enable a new input format to do
  its job both efficiently and correctly in the common case where the
 input
  is an array of similarly structured objects! I’d certainly be
 interested in
  an implementation along those lines.
 
  Cheers,
  Joe
 
  http://www.joehalliwell.com
  @joehalliwell
 
 
  On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com
 wrote:
 
  I took a quick look at that implementation. I'm not sure if it actually
  handles JSON correctly, because it attempts to find the first {
 starting
  from a random point. However, that random point could be in the middle
 of
  a
  string, and thus the first { might just be part of a string, rather
 than
  a
  real JSON object starting position.
 
 
  On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com
  wrote:
 
  You can check out the following library:
 
  https://github.com/alexholmes/json-mapreduce
 
  --
  Emre Sevinç
 
 
  On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot 
  o.girar...@lateral-thoughts.com wrote:
 
  Hi everyone,
  Is there any way in Spark SQL to load multi-line JSON data
  efficiently, I
  think there was in the mailing list a reference to
  http://pivotal-field-engineering.github.io/pmr-common/ for its
  JSONInputFormat
 
  But it's rather inaccessible considering the dependency is not
  available
  in
  any public maven repo (If you know of one, I'd be glad to hear it).
 
  Is there any plan to address this or any public recommendation ?
  (considering the documentation clearly states that
  sqlContext.jsonFile
  will
  not work for multi-line json(s))
 
  Regards,
 
  Olivier.
 
 
 
 
  --
  Emre Sevinc
 
 
 
 




Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Reynold Xin
How does the pivotal format decides where to split the files? It seems to
me the challenge is to decide that, and on the top of my head the only way
to do this is to scan from the beginning and parse the json properly, which
makes it not possible with large files (doable for whole input with a lot
of small files though). If there is a better way, we should do it.


On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not available in
 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that sqlContext.jsonFile will
 not work for multi-line json(s))

 Regards,

 Olivier.



Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
I'll try to study that and get back to you.
Regards,

Olivier.

Le lun. 4 mai 2015 à 04:05, Reynold Xin r...@databricks.com a écrit :

 How does the pivotal format decides where to split the files? It seems to
 me the challenge is to decide that, and on the top of my head the only way
 to do this is to scan from the beginning and parse the json properly, which
 makes it not possible with large files (doable for whole input with a lot
 of small files though). If there is a better way, we should do it.


 On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi everyone,
 Is there any way in Spark SQL to load multi-line JSON data efficiently, I
 think there was in the mailing list a reference to
 http://pivotal-field-engineering.github.io/pmr-common/ for its
 JSONInputFormat

 But it's rather inaccessible considering the dependency is not available
 in
 any public maven repo (If you know of one, I'd be glad to hear it).

 Is there any plan to address this or any public recommendation ?
 (considering the documentation clearly states that sqlContext.jsonFile
 will
 not work for multi-line json(s))

 Regards,

 Olivier.





Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data efficiently, I
think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat

But it's rather inaccessible considering the dependency is not available in
any public maven repo (If you know of one, I'd be glad to hear it).

Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that sqlContext.jsonFile will
not work for multi-line json(s))

Regards,

Olivier.