Re: Multi-Line JSON in SparkSQL

Ewan Higgs Tue, 05 May 2015 00:53:31 -0700

FWIW, CSV has the same problem that renders it immune to naive partitioning.


Consider the following RFC 4180 compliant record:

1,2,"
all,of,these,are,just,one,field
",4,5

Now, it's probably a terrible idea to give a file system awareness ofactual file types, but couldn't HDFS handle this nearer the replicationlevel? XML, JSON, and CSV are so pervasive it almost seems like it couldbe appropriate -if- enormous JSON files are considered enough of anissue that some basic ETL becomes a non viable solution.


-Ewan

On 05/05/15 09:37, Joe Halliwell wrote:

@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?




I suspect the algorithm is going to be bit fiddly and would definitely benefit 
from multiple heads. If possible, I think we should handle pathological cases 
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.




JSON grammar is simple enough that this feels tractable. (I wonder if there’s 
research on “start anywhere” languages/parsers in general...)




Cheers,

Joe


http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
<o.girar...@lateral-thoughts.com> wrote:

@joe, I'd be glad to help if you need.
Le lun. 4 mai 2015 à 20:06, Matei Zaharia <matei.zaha...@gmail.com> a
écrit :

I don't know whether this is common, but we might also allow another
separator for JSON objects, such as two blank lines.

Matei

On May 4, 2015, at 2:28 PM, Reynold Xin <r...@databricks.com> wrote:

Joe - I think that's a legit and useful thing to do. Do you want to give

it

a shot?

On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell <joe.halliw...@gmail.com>
wrote:

I think Reynold’s argument shows the impossibility of the general case.

But a “maximum object depth” hint could enable a new input format to do
its job both efficiently and correctly in the common case where the

input

is an array of similarly structured objects! I’d certainly be

interested in

an implementation along those lines.

Cheers,
Joe

http://www.joehalliwell.com
@joehalliwell


On Mon, May 4, 2015 at 7:55 AM, Reynold Xin <r...@databricks.com>

wrote:

I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first {

starting

from a random point. However, that random point could be in the middle

of

a
string, and thus the first { might just be part of a string, rather

than

a
real JSON object starting position.


On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc <emre.sev...@gmail.com>
wrote:

You can check out the following library:

https://github.com/alexholmes/json-mapreduce

--
Emre Sevinç


On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data

efficiently, I

think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat

But it's rather inaccessible considering the dependency is not

available

in

any public maven repo (If you know of one, I'd be glad to hear it).

Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that

sqlContext.jsonFile

will

not work for multi-line json(s))

Regards,

Olivier.



--
Emre Sevinc



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Multi-Line JSON in SparkSQL

Reply via email to