Re: About Error while reading large JSON file in Spark
On 18 Oct 2016, at 10:58, Chetan Khatri> wrote: Dear Xi shen, Thank you for getting back to question. The approach i am following are as below: I have MSSQL server as Enterprise data lack. 1. run Java jobs and generated JSON files, every file is almost 6 GB. Correct spark need every JSON on separate line, so i did sed -e 's/}/}\n/g' -s old-file.json > new-file.json to get every json element on separate lines. 2. uploaded to s3 bucket and reading from their using sqlContext.read.json() function, where i am getting above error. Note: If i am running for small size files then i am not getting this error where JSON elements are almost same structured. Current approach: *splitting large JSON(6 GB) to 1-1 GB then will process. Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM I see what you are trying to do here: one JSON file per line, then splitting by line so that you can parallelise JSON processing, as well as holding many JSON objects in a single s3 file. This is a devious little trick. It just doesn't work once the json files goes > 2^31 bytes long, as the code to split by line breaks. You could write your own input splitter which actually does basic Json parsing, splitting up by looking for the final } in a JSON clause (harder than you think, as you need to remember how many {} clauses you have entered and not include escaped "{" in strings. a quick google shows some that may be a good starting point https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java https://github.com/alexholmes/json-mapreduce
Re: About Error while reading large JSON file in Spark
On 18 Oct 2016, at 08:43, Chetan Khatri> wrote: Hello Community members, I am getting error while reading large JSON file in spark, the underlying read code can't handle more than 2^31 bytes in a single line: if (bytesConsumed > Integer.MAX_VALUE) { throw new IOException("Too many bytes before newline: " + bytesConsumed); } That's because it's trying to split work by line, and of course, there aren't lines you need to move over to reading the JSON by other means, i'm afraid. At a guess, something involving SparkContext.binaryFiles() streaming the data straight into a JSON parser, Code: val landingVisitor = sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json") unrelated, but use s3a if you can. It's better, you know. Error: 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8) java.io.IOException: Too many bytes before newline: 2147483648 at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:135) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237) What would be resolution for the same ? Thanks in Advance ! -- Yours Aye, Chetan Khatri.
Re: About Error while reading large JSON file in Spark
Dear Xi shen, Thank you for getting back to question. The approach i am following are as below: I have MSSQL server as Enterprise data lack. 1. run Java jobs and generated JSON files, every file is almost 6 GB. *Correct spark need every JSON on **separate line, so i did * sed -e 's/}/}\n/g' -s old-file.json > new-file.json to get every json element on separate lines. 2. uploaded to s3 bucket and reading from their using sqlContext.read.json() function, where i am getting above error. Note: If i am running for small size files then i am not getting this error where JSON elements are almost same structured. *Current approach:* - splitting large JSON(6 GB) to 1-1 GB then will process. Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM Thanks. On Tue, Oct 18, 2016 at 2:50 PM, Xi Shenwrote: > It is a plain Java IO error. Your line is too long. You should alter your > JSON schema, so each line is a small JSON object. > > Please do not concatenate all the object into an array, then write the > array in one line. You will have difficulty handling your super large JSON > array in Spark anyway. > > Because one array is one object, it cannot be split into multiple > partition. > > > On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri > wrote: > >> Hello Community members, >> >> I am getting error while reading large JSON file in spark, >> >> *Code:* >> >> val landingVisitor = sqlContext.read.json("s3n:// >> hist-ngdp/lvisitor/lvisitor-01-aug.json") >> >> *Error:* >> >> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID >> 8) >> java.io.IOException: Too many bytes before newline: 2147483648 >> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249) >> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) >> at org.apache.hadoop.mapred.LineRecordReader.( >> LineRecordReader.java:135) >> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader( >> TextInputFormat.java:67) >> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237) >> >> What would be resolution for the same ? >> >> Thanks in Advance ! >> >> >> -- >> Yours Aye, >> Chetan Khatri. >> >> -- > > > Thanks, > David S. > -- Yours Aye, Chetan Khatri. M.+91 7 80574 Data Science Researcher INDIA Statement of Confidentiality The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.
Re: About Error while reading large JSON file in Spark
It is a plain Java IO error. Your line is too long. You should alter your JSON schema, so each line is a small JSON object. Please do not concatenate all the object into an array, then write the array in one line. You will have difficulty handling your super large JSON array in Spark anyway. Because one array is one object, it cannot be split into multiple partition. On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatriwrote: > Hello Community members, > > I am getting error while reading large JSON file in spark, > > *Code:* > > val landingVisitor = > sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json") > > *Error:* > > 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID > 8) > java.io.IOException: Too many bytes before newline: 2147483648 > at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:135) > at > org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237) > > What would be resolution for the same ? > > Thanks in Advance ! > > > -- > Yours Aye, > Chetan Khatri. > > -- Thanks, David S.
About Error while reading large JSON file in Spark
Hello Community members, I am getting error while reading large JSON file in spark, *Code:* val landingVisitor = sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json") *Error:* 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8) java.io.IOException: Too many bytes before newline: 2147483648 at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:135) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237) What would be resolution for the same ? Thanks in Advance ! -- Yours Aye, Chetan Khatri.