Re: Multi-Line JSON in SparkSQL
FWIW, CSV has the same problem that renders it immune to naive partitioning. Consider the following RFC 4180 compliant record: 1,2, all,of,these,are,just,one,field ,4,5 Now, it's probably a terrible idea to give a file system awareness of actual file types, but couldn't HDFS handle this nearer the replication level? XML, JSON, and CSV are so pervasive it almost seems like it could be appropriate -if- enormous JSON files are considered enough of an issue that some basic ETL becomes a non viable solution. -Ewan On 05/05/15 09:37, Joe Halliwell wrote: @reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket? I suspect the algorithm is going to be bit fiddly and would definitely benefit from multiple heads. If possible, I think we should handle pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out. JSON grammar is simple enough that this feels tractable. (I wonder if there’s research on “start anywhere” languages/parsers in general...) Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: @joe, I'd be glad to help if you need. Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a écrit : I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote: Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com wrote: I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Multi-Line JSON in SparkSQL
@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket? I suspect the algorithm is going to be bit fiddly and would definitely benefit from multiple heads. If possible, I think we should handle pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out. JSON grammar is simple enough that this feels tractable. (I wonder if there’s research on “start anywhere” languages/parsers in general...) Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: @joe, I'd be glad to help if you need. Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a écrit : I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote: Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com wrote: I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc
Re: Multi-Line JSON in SparkSQL
I've raised the JSON-related ticket at https://issues.apache.org/jira/browse/SPARK-7366. @Ewan I think it would be great to support multiline CSV records too. The motivation is very similar but my instinct is that little/nothing of the implementation could be usefully shared, so it's better as a separate ticket? Cheers, Joe On 5 May 2015 at 08:51, Ewan Higgs ewan.hi...@ugent.be wrote: FWIW, CSV has the same problem that renders it immune to naive partitioning. Consider the following RFC 4180 compliant record: 1,2, all,of,these,are,just,one,field ,4,5 Now, it's probably a terrible idea to give a file system awareness of actual file types, but couldn't HDFS handle this nearer the replication level? XML, JSON, and CSV are so pervasive it almost seems like it could be appropriate -if- enormous JSON files are considered enough of an issue that some basic ETL becomes a non viable solution. -Ewan On 05/05/15 09:37, Joe Halliwell wrote: @reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket? I suspect the algorithm is going to be bit fiddly and would definitely benefit from multiple heads. If possible, I think we should handle pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out. JSON grammar is simple enough that this feels tractable. (I wonder if there’s research on “start anywhere” languages/parsers in general...) Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: @joe, I'd be glad to help if you need. Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a écrit : I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote: Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com wrote: I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc -- Best regards, Joe - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Multi-Line JSON in SparkSQL
You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc
Re: Multi-Line JSON in SparkSQL
I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc
Re: Multi-Line JSON in SparkSQL
I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc
Re: Multi-Line JSON in SparkSQL
I was wondering if it's possible to use existing Hive SerDes for this ? Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a écrit : I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc
Re: Multi-Line JSON in SparkSQL
It's not JSON, per se, but data formats like smile ( http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide support for markers that can't be confused with content and also provide reasonably similar ergonomics. — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Mon, May 4, 2015 at 5:43 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: I was wondering if it's possible to use existing Hive SerDes for this ? Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a écrit : I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc
Re: Multi-Line JSON in SparkSQL
Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com wrote: I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc
Re: Multi-Line JSON in SparkSQL
I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote: Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com wrote: I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Multi-Line JSON in SparkSQL
@joe, I'd be glad to help if you need. Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a écrit : I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote: Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com wrote: I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested in an implementation along those lines. Cheers, Joe http://www.joehalliwell.com @joehalliwell On Mon, May 4, 2015 at 7:55 AM, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a real JSON object starting position. On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc emre.sev...@gmail.com wrote: You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier. -- Emre Sevinc
Re: Multi-Line JSON in SparkSQL
How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot of small files though). If there is a better way, we should do it. On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier.
Re: Multi-Line JSON in SparkSQL
I'll try to study that and get back to you. Regards, Olivier. Le lun. 4 mai 2015 à 04:05, Reynold Xin r...@databricks.com a écrit : How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot of small files though). If there is a better way, we should do it. On Sun, May 3, 2015 at 1:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier.
Multi-Line JSON in SparkSQL
Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any public maven repo (If you know of one, I'd be glad to hear it). Is there any plan to address this or any public recommendation ? (considering the documentation clearly states that sqlContext.jsonFile will not work for multi-line json(s)) Regards, Olivier.