subject:"About Error while reading large JSON file in Spark"

Re: About Error while reading large JSON file in Spark

2016-10-19 Thread Steve Loughran

On 18 Oct 2016, at 10:58, Chetan Khatri
> wrote:

Dear Xi shen,

Thank you for getting back to question.

The approach i am following are as below:
I have MSSQL server as Enterprise data lack.

1. run Java jobs and generated JSON files, every file is almost 6 GB.
Correct spark need every JSON on separate line, so i did
sed -e 's/}/}\n/g' -s old-file.json > new-file.json
to get every json element on separate lines.
2. uploaded to s3 bucket and reading from their using sqlContext.read.json()
function, where i am getting above error.

Note: If i am running for small size files then i am not getting this error
where JSON elements are almost same structured.

Current approach:

*splitting large JSON(6 GB) to 1-1 GB then will process.

Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM

I see what you are trying to do here: one JSON file per line, then splitting by
line so that you can parallelise JSON processing, as well as holding many JSON
objects in a single s3 file. This is a devious little trick. It just doesn't
work once the json files goes > 2^31 bytes long, as the code to split by line
breaks.

You could write your own input splitter which actually does basic Json parsing,
splitting up by looking for the final } in a JSON clause (harder than you
think, as you need to remember how many {} clauses you have entered and not
include escaped "{" in strings.

a quick google shows some that may be a good starting point

https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java
https://github.com/alexholmes/json-mapreduce

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Steve Loughran


On 18 Oct 2016, at 08:43, Chetan Khatri 
> wrote:

Hello Community members,

I am getting error while reading large JSON file in spark,


the underlying read code can't handle more than 2^31 bytes in a single line:

if (bytesConsumed > Integer.MAX_VALUE) {
  throw new IOException("Too many bytes before newline: " + bytesConsumed);
}

That's because it's trying to split work by line, and of course, there aren't 
lines

you need to move over to reading the JSON by other means, i'm afraid. At a 
guess, something involving SparkContext.binaryFiles() streaming the data 
straight into a JSON parser,



Code:

val landingVisitor = 
sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")

unrelated, but use s3a if you can. It's better, you know.


Error:

16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8)
java.io.IOException: Too many bytes before newline: 2147483648
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:135)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)

What would be resolution for the same ?

Thanks in Advance !


--
Yours Aye,
Chetan Khatri.

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Chetan Khatri

Dear Xi shen,

Thank you for getting back to question.

The approach i am following are as below:
I have MSSQL server as Enterprise data lack.

1. run Java jobs and generated JSON files, every file is almost 6 GB.
*Correct spark need every JSON on **separate line, so i did *
sed -e 's/}/}\n/g' -s old-file.json > new-file.json
to get every json element on separate lines.
2. uploaded to s3 bucket and reading from their using
sqlContext.read.json() function, where i am getting above error.

Note: If i am running for small size files then i am not getting this error
where JSON elements are almost same structured.

*Current approach:*

   -  splitting large JSON(6 GB) to 1-1 GB then will process.

Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM

Thanks.

On Tue, Oct 18, 2016 at 2:50 PM, Xi Shen  wrote:

> It is a plain Java IO error. Your line is too long. You should alter your
> JSON schema, so each line is a small JSON object.
>
> Please do not concatenate all the object into an array, then write the
> array in one line. You will have difficulty handling your super large JSON
> array in Spark anyway.
>
> Because one array is one object, it cannot be split into multiple
> partition.
>
>
> On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri 
> wrote:
>
>> Hello Community members,
>>
>> I am getting error while reading large JSON file in spark,
>>
>> *Code:*
>>
>> val landingVisitor = sqlContext.read.json("s3n://
>> hist-ngdp/lvisitor/lvisitor-01-aug.json")
>>
>> *Error:*
>>
>> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID
>> 8)
>> java.io.IOException: Too many bytes before newline: 2147483648
>> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
>> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>> at org.apache.hadoop.mapred.LineRecordReader.(
>> LineRecordReader.java:135)
>> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(
>> TextInputFormat.java:67)
>> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
>>
>> What would be resolution for the same ?
>>
>> Thanks in Advance !
>>
>>
>> --
>> Yours Aye,
>> Chetan Khatri.
>>
>> --
>
>
> Thanks,
> David S.
>

-- 
Yours Aye,
Chetan Khatri.
M.+91 7 80574
Data Science Researcher
INDIA

Statement of Confidentiality

The contents of this e-mail message and any attachments are confidential
and are intended solely for addressee. The information may also be legally
privileged. This transmission is sent in trust, for the sole purpose of
delivery to the intended recipient. If you have received this transmission
in error, any use, reproduction or dissemination of this transmission is
strictly prohibited. If you are not the intended recipient, please
immediately notify the sender by reply e-mail or phone and delete this
message and its attachments, if any.

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Xi Shen

It is a plain Java IO error. Your line is too long. You should alter your
JSON schema, so each line is a small JSON object.

Please do not concatenate all the object into an array, then write the
array in one line. You will have difficulty handling your super large JSON
array in Spark anyway.

Because one array is one object, it cannot be split into multiple partition.

On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri 
wrote:

> Hello Community members,
>
> I am getting error while reading large JSON file in spark,
>
> *Code:*
>
> val landingVisitor =
> sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")
>
> *Error:*
>
> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID
> 8)
> java.io.IOException: Too many bytes before newline: 2147483648
> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:135)
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
>
> What would be resolution for the same ?
>
> Thanks in Advance !
>
>
> --
> Yours Aye,
> Chetan Khatri.
>
> --

Thanks,
David S.

About Error while reading large JSON file in Spark

2016-10-18 Thread Chetan Khatri

Hello Community members,

I am getting error while reading large JSON file in spark,

*Code:*

val landingVisitor =
sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")

*Error:*

16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8)
java.io.IOException: Too many bytes before newline: 2147483648
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:135)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)

What would be resolution for the same ?

Thanks in Advance !


-- 
Yours Aye,
Chetan Khatri.

Re: About Error while reading large JSON file in Spark

Re: About Error while reading large JSON file in Spark

Re: About Error while reading large JSON file in Spark

Re: About Error while reading large JSON file in Spark

About Error while reading large JSON file in Spark

5 matches

Site Navigation

Mail list logo

Footer information