I've raised the JSON-related ticket at https://issues.apache.org/jira/browse/SPARK-7366.
@Ewan I think it would be great to support multiline CSV records too. The motivation is very similar but my instinct is that little/nothing of the implementation could be usefully shared, so it's better as a separate ticket? Cheers, Joe On 5 May 2015 at 08:51, Ewan Higgs <ewan.hi...@ugent.be> wrote: > FWIW, CSV has the same problem that renders it immune to naive partitioning. > > Consider the following RFC 4180 compliant record: > > 1,2," > all,of,these,are,just,one,field > ",4,5 > > Now, it's probably a terrible idea to give a file system awareness of actual > file types, but couldn't HDFS handle this nearer the replication level? XML, > JSON, and CSV are so pervasive it almost seems like it could be appropriate > -if- enormous JSON files are considered enough of an issue that some basic > ETL becomes a non viable solution. > > -Ewan > > On 05/05/15 09:37, Joe Halliwell wrote: >> >> @reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket? >> >> >> >> >> >> I suspect the algorithm is going to be bit fiddly and would definitely >> benefit from multiple heads. If possible, I think we should handle >> pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing >> out. >> >> >> >> >> JSON grammar is simple enough that this feels tractable. (I wonder if >> there’s research on “start anywhere” languages/parsers in general...) >> >> >> >> >> Cheers, >> >> Joe >> >> >> http://www.joehalliwell.com >> >> @joehalliwell >> >> On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot >> <o.girar...@lateral-thoughts.com> wrote: >> >>> @joe, I'd be glad to help if you need. >>> Le lun. 4 mai 2015 à 20:06, Matei Zaharia <matei.zaha...@gmail.com> a >>> écrit : >>>> >>>> I don't know whether this is common, but we might also allow another >>>> separator for JSON objects, such as two blank lines. >>>> >>>> Matei >>>> >>>>> On May 4, 2015, at 2:28 PM, Reynold Xin <r...@databricks.com> wrote: >>>>> >>>>> Joe - I think that's a legit and useful thing to do. Do you want to >>>>> give >>>> >>>> it >>>>> >>>>> a shot? >>>>> >>>>> On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell >>>>> <joe.halliw...@gmail.com> >>>>> wrote: >>>>> >>>>>> I think Reynold’s argument shows the impossibility of the general >>>>>> case. >>>>>> >>>>>> But a “maximum object depth” hint could enable a new input format to >>>>>> do >>>>>> its job both efficiently and correctly in the common case where the >>>> >>>> input >>>>>> >>>>>> is an array of similarly structured objects! I’d certainly be >>>> >>>> interested in >>>>>> >>>>>> an implementation along those lines. >>>>>> >>>>>> Cheers, >>>>>> Joe >>>>>> >>>>>> http://www.joehalliwell.com >>>>>> @joehalliwell >>>>>> >>>>>> >>>>>> On Mon, May 4, 2015 at 7:55 AM, Reynold Xin <r...@databricks.com> >>>> >>>> wrote: >>>>>>> >>>>>>> I took a quick look at that implementation. I'm not sure if it >>>>>>> actually >>>>>>> handles JSON correctly, because it attempts to find the first { >>>> >>>> starting >>>>>>> >>>>>>> from a random point. However, that random point could be in the >>>>>>> middle >>>> >>>> of >>>>>>> >>>>>>> a >>>>>>> string, and thus the first { might just be part of a string, rather >>>> >>>> than >>>>>>> >>>>>>> a >>>>>>> real JSON object starting position. >>>>>>> >>>>>>> >>>>>>> On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc <emre.sev...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> You can check out the following library: >>>>>>>> >>>>>>>> https://github.com/alexholmes/json-mapreduce >>>>>>>> >>>>>>>> -- >>>>>>>> Emre Sevinç >>>>>>>> >>>>>>>> >>>>>>>> On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot < >>>>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> Is there any way in Spark SQL to load multi-line JSON data >>>>>>> >>>>>>> efficiently, I >>>>>>>>> >>>>>>>>> think there was in the mailing list a reference to >>>>>>>>> http://pivotal-field-engineering.github.io/pmr-common/ for its >>>>>>>>> JSONInputFormat >>>>>>>>> >>>>>>>>> But it's rather inaccessible considering the dependency is not >>>>>>> >>>>>>> available >>>>>>>> >>>>>>>> in >>>>>>>>> >>>>>>>>> any public maven repo (If you know of one, I'd be glad to hear it). >>>>>>>>> >>>>>>>>> Is there any plan to address this or any public recommendation ? >>>>>>>>> (considering the documentation clearly states that >>>>>>> >>>>>>> sqlContext.jsonFile >>>>>>>> >>>>>>>> will >>>>>>>>> >>>>>>>>> not work for multi-line json(s)) >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Olivier. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Emre Sevinc >>>>>>>> >>>>>> >>>> > -- Best regards, Joe --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org