@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?
I suspect the algorithm is going to be bit fiddly and would definitely benefit from multiple heads. If possible, I think we should handle pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out. JSON grammar is simple enough that this feels tractable. (I wonder if there’s research on “start anywhere” languages/parsers in general...) Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot <o.girar...@lateral-thoughts.com> wrote: > @joe, I'd be glad to help if you need. > Le lun. 4 mai 2015 à 20:06, Matei Zaharia <matei.zaha...@gmail.com> a > écrit : >> I don't know whether this is common, but we might also allow another >> separator for JSON objects, such as two blank lines. >> >> Matei >> >> > On May 4, 2015, at 2:28 PM, Reynold Xin <r...@databricks.com> wrote: >> > >> > Joe - I think that's a legit and useful thing to do. Do you want to give >> it >> > a shot? >> > >> > On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell <joe.halliw...@gmail.com> >> > wrote: >> > >> >> I think Reynold’s argument shows the impossibility of the general case. >> >> >> >> But a “maximum object depth” hint could enable a new input format to do >> >> its job both efficiently and correctly in the common case where the >> input >> >> is an array of similarly structured objects! I’d certainly be >> interested in >> >> an implementation along those lines. >> >> >> >> Cheers, >> >> Joe >> >> >> >> http://www.joehalliwell.com >> >> @joehalliwell >> >> >> >> >> >> On Mon, May 4, 2015 at 7:55 AM, Reynold Xin <r...@databricks.com> >> wrote: >> >> >> >>> I took a quick look at that implementation. I'm not sure if it actually >> >>> handles JSON correctly, because it attempts to find the first { >> starting >> >>> from a random point. However, that random point could be in the middle >> of >> >>> a >> >>> string, and thus the first { might just be part of a string, rather >> than >> >>> a >> >>> real JSON object starting position. >> >>> >> >>> >> >>> On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc <emre.sev...@gmail.com> >> >>> wrote: >> >>> >> >>>> You can check out the following library: >> >>>> >> >>>> https://github.com/alexholmes/json-mapreduce >> >>>> >> >>>> -- >> >>>> Emre Sevinç >> >>>> >> >>>> >> >>>> On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot < >> >>>> o.girar...@lateral-thoughts.com> wrote: >> >>>> >> >>>>> Hi everyone, >> >>>>> Is there any way in Spark SQL to load multi-line JSON data >> >>> efficiently, I >> >>>>> think there was in the mailing list a reference to >> >>>>> http://pivotal-field-engineering.github.io/pmr-common/ for its >> >>>>> JSONInputFormat >> >>>>> >> >>>>> But it's rather inaccessible considering the dependency is not >> >>> available >> >>>> in >> >>>>> any public maven repo (If you know of one, I'd be glad to hear it). >> >>>>> >> >>>>> Is there any plan to address this or any public recommendation ? >> >>>>> (considering the documentation clearly states that >> >>> sqlContext.jsonFile >> >>>> will >> >>>>> not work for multi-line json(s)) >> >>>>> >> >>>>> Regards, >> >>>>> >> >>>>> Olivier. >> >>>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Emre Sevinc >> >>>> >> >>> >> >> >> >> >> >>