Dear Xi shen,

Thank you for getting back to question.

The approach i am following are as below:
I have MSSQL server as Enterprise data lack.

1. run Java jobs and generated JSON files, every file is almost 6 GB.
*Correct spark need every JSON on **separate line, so i did *
sed -e 's/}/}\n/g' -s old-file.json > new-file.json
to get every json element on separate lines.
2. uploaded to s3 bucket and reading from their using
sqlContext.read.json() function, where i am getting above error.

Note: If i am running for small size files then i am not getting this error
where JSON elements are almost same structured.

*Current approach:*


   -  splitting large JSON(6 GB) to 1-1 GB then will process.

Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM


Thanks.




On Tue, Oct 18, 2016 at 2:50 PM, Xi Shen <davidshe...@gmail.com> wrote:

> It is a plain Java IO error. Your line is too long. You should alter your
> JSON schema, so each line is a small JSON object.
>
> Please do not concatenate all the object into an array, then write the
> array in one line. You will have difficulty handling your super large JSON
> array in Spark anyway.
>
> Because one array is one object, it cannot be split into multiple
> partition.
>
>
> On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri <ckhatriman...@gmail.com>
> wrote:
>
>> Hello Community members,
>>
>> I am getting error while reading large JSON file in spark,
>>
>> *Code:*
>>
>> val landingVisitor = sqlContext.read.json("s3n://
>> hist-ngdp/lvisitor/lvisitor-01-aug.json")
>>
>> *Error:*
>>
>> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID
>> 8)
>> java.io.IOException: Too many bytes before newline: 2147483648
>> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
>> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>> at org.apache.hadoop.mapred.LineRecordReader.<init>(
>> LineRecordReader.java:135)
>> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(
>> TextInputFormat.java:67)
>> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)
>>
>> What would be resolution for the same ?
>>
>> Thanks in Advance !
>>
>>
>> --
>> Yours Aye,
>> Chetan Khatri.
>>
>> --
>
>
> Thanks,
> David S.
>



-- 
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

​​Statement of Confidentiality
————————————————————————————
The contents of this e-mail message and any attachments are confidential
and are intended solely for addressee. The information may also be legally
privileged. This transmission is sent in trust, for the sole purpose of
delivery to the intended recipient. If you have received this transmission
in error, any use, reproduction or dissemination of this transmission is
strictly prohibited. If you are not the intended recipient, please
immediately notify the sender by reply e-mail or phone and delete this
message and its attachments, if any.​​

Reply via email to