FWIW, CSV has the same problem that renders it immune to naive partitioning.

Consider the following RFC 4180 compliant record:

1,2,"
all,of,these,are,just,one,field
",4,5

Now, it's probably a terrible idea to give a file system awareness of actual file types, but couldn't HDFS handle this nearer the replication level? XML, JSON, and CSV are so pervasive it almost seems like it could be appropriate -if- enormous JSON files are considered enough of an issue that some basic ETL becomes a non viable solution.

-Ewan

On 05/05/15 09:37, Joe Halliwell wrote:
@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?




I suspect the algorithm is going to be bit fiddly and would definitely benefit 
from multiple heads. If possible, I think we should handle pathological cases 
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.




JSON grammar is simple enough that this feels tractable. (I wonder if there’s 
research on “start anywhere” languages/parsers in general...)




Cheers,

Joe


http://www.joehalliwell.com

@joehalliwell

On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
<o.girar...@lateral-thoughts.com> wrote:

@joe, I'd be glad to help if you need.
Le lun. 4 mai 2015 à 20:06, Matei Zaharia <matei.zaha...@gmail.com> a
écrit :
I don't know whether this is common, but we might also allow another
separator for JSON objects, such as two blank lines.

Matei

On May 4, 2015, at 2:28 PM, Reynold Xin <r...@databricks.com> wrote:

Joe - I think that's a legit and useful thing to do. Do you want to give
it
a shot?

On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell <joe.halliw...@gmail.com>
wrote:

I think Reynold’s argument shows the impossibility of the general case.

But a “maximum object depth” hint could enable a new input format to do
its job both efficiently and correctly in the common case where the
input
is an array of similarly structured objects! I’d certainly be
interested in
an implementation along those lines.

Cheers,
Joe

http://www.joehalliwell.com
@joehalliwell


On Mon, May 4, 2015 at 7:55 AM, Reynold Xin <r...@databricks.com>
wrote:
I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first {
starting
from a random point. However, that random point could be in the middle
of
a
string, and thus the first { might just be part of a string, rather
than
a
real JSON object starting position.


On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc <emre.sev...@gmail.com>
wrote:

You can check out the following library:

https://github.com/alexholmes/json-mapreduce

--
Emre Sevinç


On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data
efficiently, I
think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat

But it's rather inaccessible considering the dependency is not
available
in
any public maven repo (If you know of one, I'd be glad to hear it).

Is there any plan to address this or any public recommendation ?
(considering the documentation clearly states that
sqlContext.jsonFile
will
not work for multi-line json(s))

Regards,

Olivier.



--
Emre Sevinc





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to