Re: Multi-Line JSON in SparkSQL

Joe Halliwell Tue, 05 May 2015 04:15:07 -0700

I've raised the JSON-related ticket at
https://issues.apache.org/jira/browse/SPARK-7366.


@Ewan I think it would be great to support multiline CSV records too.
The motivation is very similar but my instinct is that little/nothing
of the implementation could be usefully shared, so it's better as a
separate ticket?

Cheers,
Joe

On 5 May 2015 at 08:51, Ewan Higgs <ewan.hi...@ugent.be> wrote:
> FWIW, CSV has the same problem that renders it immune to naive partitioning.
>
> Consider the following RFC 4180 compliant record:
>
> 1,2,"
> all,of,these,are,just,one,field
> ",4,5
>
> Now, it's probably a terrible idea to give a file system awareness of actual
> file types, but couldn't HDFS handle this nearer the replication level? XML,
> JSON, and CSV are so pervasive it almost seems like it could be appropriate
> -if- enormous JSON files are considered enough of an issue that some basic
> ETL becomes a non viable solution.
>
> -Ewan
>
> On 05/05/15 09:37, Joe Halliwell wrote:
>>
>> @reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?
>>
>>
>>
>>
>>
>> I suspect the algorithm is going to be bit fiddly and would definitely
>> benefit from multiple heads. If possible, I think we should handle
>> pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing
>> out.
>>
>>
>>
>>
>> JSON grammar is simple enough that this feels tractable. (I wonder if
>> there’s research on “start anywhere” languages/parsers in general...)
>>
>>
>>
>>
>> Cheers,
>>
>> Joe
>>
>>
>> http://www.joehalliwell.com
>>
>> @joehalliwell
>>
>> On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
>> <o.girar...@lateral-thoughts.com> wrote:
>>
>>> @joe, I'd be glad to help if you need.
>>> Le lun. 4 mai 2015 à 20:06, Matei Zaharia <matei.zaha...@gmail.com> a
>>> écrit :
>>>>
>>>> I don't know whether this is common, but we might also allow another
>>>> separator for JSON objects, such as two blank lines.
>>>>
>>>> Matei
>>>>
>>>>> On May 4, 2015, at 2:28 PM, Reynold Xin <r...@databricks.com> wrote:
>>>>>
>>>>> Joe - I think that's a legit and useful thing to do. Do you want to
>>>>> give
>>>>
>>>> it
>>>>>
>>>>> a shot?
>>>>>
>>>>> On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell
>>>>> <joe.halliw...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think Reynold’s argument shows the impossibility of the general
>>>>>> case.
>>>>>>
>>>>>> But a “maximum object depth” hint could enable a new input format to
>>>>>> do
>>>>>> its job both efficiently and correctly in the common case where the
>>>>
>>>> input
>>>>>>
>>>>>> is an array of similarly structured objects! I’d certainly be
>>>>
>>>> interested in
>>>>>>
>>>>>> an implementation along those lines.
>>>>>>
>>>>>> Cheers,
>>>>>> Joe
>>>>>>
>>>>>> http://www.joehalliwell.com
>>>>>> @joehalliwell
>>>>>>
>>>>>>
>>>>>> On Mon, May 4, 2015 at 7:55 AM, Reynold Xin <r...@databricks.com>
>>>>
>>>> wrote:
>>>>>>>
>>>>>>> I took a quick look at that implementation. I'm not sure if it
>>>>>>> actually
>>>>>>> handles JSON correctly, because it attempts to find the first {
>>>>
>>>> starting
>>>>>>>
>>>>>>> from a random point. However, that random point could be in the
>>>>>>> middle
>>>>
>>>> of
>>>>>>>
>>>>>>> a
>>>>>>> string, and thus the first { might just be part of a string, rather
>>>>
>>>> than
>>>>>>>
>>>>>>> a
>>>>>>> real JSON object starting position.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc <emre.sev...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You can check out the following library:
>>>>>>>>
>>>>>>>> https://github.com/alexholmes/json-mapreduce
>>>>>>>>
>>>>>>>> --
>>>>>>>> Emre Sevinç
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot <
>>>>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>> Is there any way in Spark SQL to load multi-line JSON data
>>>>>>>
>>>>>>> efficiently, I
>>>>>>>>>
>>>>>>>>> think there was in the mailing list a reference to
>>>>>>>>> http://pivotal-field-engineering.github.io/pmr-common/ for its
>>>>>>>>> JSONInputFormat
>>>>>>>>>
>>>>>>>>> But it's rather inaccessible considering the dependency is not
>>>>>>>
>>>>>>> available
>>>>>>>>
>>>>>>>> in
>>>>>>>>>
>>>>>>>>> any public maven repo (If you know of one, I'd be glad to hear it).
>>>>>>>>>
>>>>>>>>> Is there any plan to address this or any public recommendation ?
>>>>>>>>> (considering the documentation clearly states that
>>>>>>>
>>>>>>> sqlContext.jsonFile
>>>>>>>>
>>>>>>>> will
>>>>>>>>>
>>>>>>>>> not work for multi-line json(s))
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Olivier.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Emre Sevinc
>>>>>>>>
>>>>>>
>>>>
>



-- 
Best regards,
Joe

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Multi-Line JSON in SparkSQL

Reply via email to