How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot of small files though). If there is a better way, we should do it.
On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > Is there any way in Spark SQL to load multi-line JSON data efficiently, I > think there was in the mailing list a reference to > http://pivotal-field-engineering.github.io/pmr-common/ for its > JSONInputFormat > > But it's rather inaccessible considering the dependency is not available in > any public maven repo (If you know of one, I'd be glad to hear it). > > Is there any plan to address this or any public recommendation ? > (considering the documentation clearly states that sqlContext.jsonFile will > not work for multi-line json(s)) > > Regards, > > Olivier. >