Re: Multi-Line JSON in SparkSQL

Joe Halliwell Tue, 05 May 2015 00:38:58 -0700

@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?





I suspect the algorithm is going to be bit fiddly and would definitely benefit 
from multiple heads. If possible, I think we should handle pathological cases 
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.




JSON grammar is simple enough that this feels tractable. (I wonder if there’s 
research on “start anywhere” languages/parsers in general...)




Cheers,

Joe


http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
<o.girar...@lateral-thoughts.com> wrote:

> @joe, I'd be glad to help if you need.
> Le lun. 4 mai 2015 à 20:06, Matei Zaharia <matei.zaha...@gmail.com> a
> écrit :
>> I don't know whether this is common, but we might also allow another
>> separator for JSON objects, such as two blank lines.
>>
>> Matei
>>
>> > On May 4, 2015, at 2:28 PM, Reynold Xin <r...@databricks.com> wrote:
>> >
>> > Joe - I think that's a legit and useful thing to do. Do you want to give
>> it
>> > a shot?
>> >
>> > On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell <joe.halliw...@gmail.com>
>> > wrote:
>> >
>> >> I think Reynold’s argument shows the impossibility of the general case.
>> >>
>> >> But a “maximum object depth” hint could enable a new input format to do
>> >> its job both efficiently and correctly in the common case where the
>> input
>> >> is an array of similarly structured objects! I’d certainly be
>> interested in
>> >> an implementation along those lines.
>> >>
>> >> Cheers,
>> >> Joe
>> >>
>> >> http://www.joehalliwell.com
>> >> @joehalliwell
>> >>
>> >>
>> >> On Mon, May 4, 2015 at 7:55 AM, Reynold Xin <r...@databricks.com>
>> wrote:
>> >>
>> >>> I took a quick look at that implementation. I'm not sure if it actually
>> >>> handles JSON correctly, because it attempts to find the first {
>> starting
>> >>> from a random point. However, that random point could be in the middle
>> of
>> >>> a
>> >>> string, and thus the first { might just be part of a string, rather
>> than
>> >>> a
>> >>> real JSON object starting position.
>> >>>
>> >>>
>> >>> On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc <emre.sev...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> You can check out the following library:
>> >>>>
>> >>>> https://github.com/alexholmes/json-mapreduce
>> >>>>
>> >>>> --
>> >>>> Emre Sevinç
>> >>>>
>> >>>>
>> >>>> On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot <
>> >>>> o.girar...@lateral-thoughts.com> wrote:
>> >>>>
>> >>>>> Hi everyone,
>> >>>>> Is there any way in Spark SQL to load multi-line JSON data
>> >>> efficiently, I
>> >>>>> think there was in the mailing list a reference to
>> >>>>> http://pivotal-field-engineering.github.io/pmr-common/ for its
>> >>>>> JSONInputFormat
>> >>>>>
>> >>>>> But it's rather inaccessible considering the dependency is not
>> >>> available
>> >>>> in
>> >>>>> any public maven repo (If you know of one, I'd be glad to hear it).
>> >>>>>
>> >>>>> Is there any plan to address this or any public recommendation ?
>> >>>>> (considering the documentation clearly states that
>> >>> sqlContext.jsonFile
>> >>>> will
>> >>>>> not work for multi-line json(s))
>> >>>>>
>> >>>>> Regards,
>> >>>>>
>> >>>>> Olivier.
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Emre Sevinc
>> >>>>
>> >>>
>> >>
>> >>
>>
>>

Re: Multi-Line JSON in SparkSQL

Reply via email to