[
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215027#comment-15215027
]
Shubhanshu Mishra commented on SPARK-14103:
-------------------------------------------
Yes, the error does say so. However, I have checked the file. It has `\r\n`
style line endings and otherwise looks perfectly file.
I am able to process the file correctly using cut but am not able to do it
using the CSV format reader. The maximum number of characters in a given line
in my file are 697 with 97 as the minimum.
It looks like the line ending characters are causing an issue and spark is
normalizing the line ending character to be just "\n" instead of "\r\n".
I tried to run my command with the following settings and was able to prove
this intuition:
{code}
In [2]: df = sqlContext.read.load("temp.txt", format="csv", header="false",
inferSchema="true", delimiter="\t", rowSeparator="\r\n", inputBufferSize=500,
maxColumns=20, maxCharsPerColumn=1000)
16/03/28 17:17:39 ERROR Executor: Exception in task 1.0 in stage 3.0 (TID 5)
com.univocity.parsers.common.TextParsingException: Error processing input:
Length of parsed input (1001) exceeds the maximum number of characters defined
in your parser settings (1000).
Identified line separator characters in the parsed content. This may be the
cause of the error. The line separator in your parser settings is set to '\n'.
Parsed content:
I did it my way": moving away from the tyranny of turn-by-turn
pedestrian navigation i did it my way moving away from the tyranny of turn
by turn pedestrian navigation 2010 2010/09/07
10.1145/1851600.1851660 international conference on human computer interaction
interact 43331058 18871[\n]
770CA612 Fixed in time and "time in motion": mobility of vision
through a SenseCam lens fixed in time and time in motion mobility of vision
through a sensecam lens 2009 2009/09/15 10.1145/1613858.1613861
international conference on human computer interaction interact
43331058 19370[\n]
7B5DE5DE Assistive Wearable Technology for Visually Impaired
assistive wearable technology for visually impaired 2015 2015/08/24
international conference on human computer interaction interact
43331058 19555[\n]
085BEC09 HOUDINI: Introducing Object Tracking and Pen
Recognition for LLP Tabletops houdini introducing object tracking and pen
recognition for llp tabletops 2014 2014/06/22
10.1007/978-3-319-07230-2_23 international c
Parser Configuration: CsvParserSettings:
Column reordering enabled=true
Empty value=null
Header extraction enabled=false
Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10]
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Line separator detection enabled=false
Maximum number of characters per column=1000
Maximum number of columns=20
Null value=
Number of records to read=all
Parse unescaped quotes=true
Row processor=none
Selected fields=none
Skip empty lines=trueFormat configuration:
CsvFormat:
Comment character=\0
Field delimiter=\t
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=quote escape
Quote escape escape character=\0, line=36, char=9828. Content
parsed: [I did it my way": moving away from the tyranny of turn-by-turn
pedestrian navigation i did it my way moving away from the tyranny of turn
by turn pedestrian navigation 2010 2010/09/07
10.1145/1851600.1851660 international conference on human computer interaction
interact 43331058 18871
770CA612 Fixed in time and "time in motion": mobility of vision through
a SenseCam lens fixed in time and time in motion mobility of vision through a
sensecam lens 2009 2009/09/15 10.1145/1613858.1613861 international
conference on human computer interaction interact 43331058
19370
7B5DE5DE Assistive Wearable Technology for Visually Impaired
assistive wearable technology for visually impaired 2015 2015/08/24
international conference on human computer interaction interact
43331058 19555
085BEC09 HOUDINI: Introducing Object Tracking and Pen Recognition for
LLP Tabletops houdini introducing object tracking and pen recognition for
llp tabletops 2014 2014/06/22 10.1007/978-3-319-07230-2_23
international c]
at
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
at
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException
16/03/28 17:17:39 ERROR TaskSetManager: Task 1 in stage 3.0 failed 1 times;
aborting job
16/03/28 17:17:39 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 4)
com.univocity.parsers.common.TextParsingException: Error processing input:
org.apache.spark.TaskKilledException - null
Parser Configuration: CsvParserSettings:
Column reordering enabled=true
Empty value=null
Header extraction enabled=false
Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10]
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Line separator detection enabled=false
Maximum number of characters per column=1000
Maximum number of columns=20
Null value=
Number of records to read=all
Parse unescaped quotes=true
Row processor=none
Selected fields=none
Skip empty lines=trueFormat configuration:
CsvFormat:
Comment character=\0
Field delimiter=\t
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=quote escape
Quote escape escape character=\0, line=706, char=197760.
Content parsed: [mexic]
at
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
at
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.TaskKilledException
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at
org.apache.spark.sql.execution.datasources.csv.StringIteratorReader.refill(CSVParser.scala:167)
at
org.apache.spark.sql.execution.datasources.csv.StringIteratorReader.read(CSVParser.scala:195)
at
org.apache.spark.sql.execution.datasources.csv.StringIteratorReader.read(CSVParser.scala:215)
at
com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:81)
at
com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:118)
at
com.univocity.parsers.common.input.AbstractCharInputReader.nextChar(AbstractCharInputReader.java:180)
at com.univocity.parsers.csv.CsvParser.parseValue(CsvParser.java:94)
at com.univocity.parsers.csv.CsvParser.parseField(CsvParser.java:179)
at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:75)
at
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:328)
... 18 more
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-cdfcc501837a> in <module>()
----> 1 df = sqlContext.read.load("temp.txt", format="csv", header="false",
inferSchema="true", delimiter="\t", rowSeparator="\r\n", inputBufferSize=500,
maxColumns=20, maxCharsPerColumn=1000)
/spark/python/pyspark/sql/readwriter.pyc in load(self, path, format, schema,
**options)
133 if type(path) != list:
134 path = [path]
--> 135 return
self._df(self._jreader.load(self._sqlContext._sc._jvm.PythonUtils.toSeq(path)))
136 else:
137 return self._df(self._jreader.load())
/spark/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py in __call__(self,
*args)
834 answer = self.gateway_client.send_command(command)
835 return_value = get_return_value(
--> 836 answer, self.gateway_client, self.target_id, self.name)
837
838 for temp_arg in temp_args:
/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/spark/python/lib/py4j-0.9.2-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
308 raise Py4JJavaError(
309 "An error occurred while calling {0}{1}{2}.\n".
--> 310 format(target_id, ".", name), value)
311 else:
312 raise Py4JError(
Py4JJavaError: An error occurred while calling o44.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
stage 3.0 failed 1 times, most recent failure: Lost task 1.0 in stage 3.0 (TID
5, localhost): com.univocity.parsers.common.TextParsingException: Error
processing input: Length of parsed input (1001) exceeds the maximum number of
characters defined in your parser settings (1000).
Identified line separator characters in the parsed content. This may be the
cause of the error. The line separator in your parser settings is set to '\n'.
Parsed content:
I did it my way": moving away from the tyranny of turn-by-turn
pedestrian navigation i did it my way moving away from the tyranny of turn
by turn pedestrian navigation 2010 2010/09/07
10.1145/1851600.1851660 international conference on human computer interaction
interact 43331058 18871[\n]
770CA612 Fixed in time and "time in motion": mobility of vision
through a SenseCam lens fixed in time and time in motion mobility of vision
through a sensecam lens 2009 2009/09/15 10.1145/1613858.1613861
international conference on human computer interaction interact
43331058 19370[\n]
7B5DE5DE Assistive Wearable Technology for Visually Impaired
assistive wearable technology for visually impaired 2015 2015/08/24
international conference on human computer interaction interact
43331058 19555[\n]
085BEC09 HOUDINI: Introducing Object Tracking and Pen
Recognition for LLP Tabletops houdini introducing object tracking and pen
recognition for llp tabletops 2014 2014/06/22
10.1007/978-3-319-07230-2_23 international c
Parser Configuration: CsvParserSettings:
Column reordering enabled=true
Empty value=null
Header extraction enabled=false
Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10]
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Line separator detection enabled=false
Maximum number of characters per column=1000
Maximum number of columns=20
Null value=
Number of records to read=all
Parse unescaped quotes=true
Row processor=none
Selected fields=none
Skip empty lines=trueFormat configuration:
CsvFormat:
Comment character=\0
Field delimiter=\t
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=quote escape
Quote escape escape character=\0, line=36, char=9828. Content
parsed: [I did it my way": moving away from the tyranny of turn-by-turn
pedestrian navigation i did it my way moving away from the tyranny of turn
by turn pedestrian navigation 2010 2010/09/07
10.1145/1851600.1851660 international conference on human computer interaction
interact 43331058 18871
770CA612 Fixed in time and "time in motion": mobility of vision through
a SenseCam lens fixed in time and time in motion mobility of vision through a
sensecam lens 2009 2009/09/15 10.1145/1613858.1613861 international
conference on human computer interaction interact 43331058
19370
7B5DE5DE Assistive Wearable Technology for Visually Impaired
assistive wearable technology for visually impaired 2015 2015/08/24
international conference on human computer interaction interact
43331058 19555
085BEC09 HOUDINI: Introducing Object Tracking and Pen Recognition for
LLP Tabletops houdini introducing object tracking and pen recognition for
llp tabletops 2014 2014/06/22 10.1007/978-3-319-07230-2_23
international c]
at
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
at
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1445)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1444)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1444)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1666)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1625)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1614)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1765)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1828)
at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1060)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1053)
at
org.apache.spark.sql.execution.datasources.csv.CSVInferSchema$.infer(CSVInferSchema.scala:48)
at
org.apache.spark.sql.execution.datasources.csv.DefaultSource.inferSchema(DefaultSource.scala:69)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:292)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:292)
at scala.Option.orElse(Option.scala:289)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:291)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:162)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:290)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.univocity.parsers.common.TextParsingException: Error processing
input: Length of parsed input (1001) exceeds the maximum number of characters
defined in your parser settings (1000).
Identified line separator characters in the parsed content. This may be the
cause of the error. The line separator in your parser settings is set to '\n'.
Parsed content:
I did it my way": moving away from the tyranny of turn-by-turn
pedestrian navigation i did it my way moving away from the tyranny of turn
by turn pedestrian navigation 2010 2010/09/07
10.1145/1851600.1851660 international conference on human computer interaction
interact 43331058 18871[\n]
770CA612 Fixed in time and "time in motion": mobility of vision
through a SenseCam lens fixed in time and time in motion mobility of vision
through a sensecam lens 2009 2009/09/15 10.1145/1613858.1613861
international conference on human computer interaction interact
43331058 19370[\n]
7B5DE5DE Assistive Wearable Technology for Visually Impaired
assistive wearable technology for visually impaired 2015 2015/08/24
international conference on human computer interaction interact
43331058 19555[\n]
085BEC09 HOUDINI: Introducing Object Tracking and Pen
Recognition for LLP Tabletops houdini introducing object tracking and pen
recognition for llp tabletops 2014 2014/06/22
10.1007/978-3-319-07230-2_23 international c
Parser Configuration: CsvParserSettings:
Column reordering enabled=true
Empty value=null
Header extraction enabled=false
Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10]
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Line separator detection enabled=false
Maximum number of characters per column=1000
Maximum number of columns=20
Null value=
Number of records to read=all
Parse unescaped quotes=true
Row processor=none
Selected fields=none
Skip empty lines=trueFormat configuration:
CsvFormat:
Comment character=\0
Field delimiter=\t
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=quote escape
Quote escape escape character=\0, line=36, char=9828. Content
parsed: [I did it my way": moving away from the tyranny of turn-by-turn
pedestrian navigation i did it my way moving away from the tyranny of turn
by turn pedestrian navigation 2010 2010/09/07
10.1145/1851600.1851660 international conference on human computer interaction
interact 43331058 18871
770CA612 Fixed in time and "time in motion": mobility of vision through
a SenseCam lens fixed in time and time in motion mobility of vision through a
sensecam lens 2009 2009/09/15 10.1145/1613858.1613861 international
conference on human computer interaction interact 43331058
19370
7B5DE5DE Assistive Wearable Technology for Visually Impaired
assistive wearable technology for visually impaired 2015 2015/08/24
international conference on human computer interaction interact
43331058 19555
085BEC09 HOUDINI: Introducing Object Tracking and Pen Recognition for
LLP Tabletops houdini introducing object tracking and pen recognition for
llp tabletops 2014 2014/06/22 10.1007/978-3-319-07230-2_23
international c]
at
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
at
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
{code}
Can we file this as a relevant issue of the CSV reader?
> Python DataFrame CSV load on large file is writing to console in Ipython
> ------------------------------------------------------------------------
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master
> branch
> Reporter: Shubhanshu Mishra
> Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following
> command on a large tab separated file then I get the contents of the file
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false",
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:> (0 + 2)
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input:
> Length of parsed input (1000001) exceeds the maximum number of characters
> defined in your parser settings (1000000). Identified line separator
> characters in the parsed content. This may be the cause of the error. The
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in
> mobile location sharing applications privacy shake a haptic interface
> for managing privacy settings in mobile location sharing applications 2010
> 2010/09/07 international conference on human computer
> interaction interact 43331058 19371[\n]
> 3D4F6CA1 Between the Profiles: Another such Bias. Technology
> Acceptance Studies on Social Network Services between the profiles
> another such bias technology acceptance studies on social network services
> 2015 2015/08/02 10.1007/978-3-319-21383-5_12 international
> conference on human-computer interaction interact 43331058
> 19502[\n]
> .......
> .........
> web snippets 2008 2008/05/04 10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist 44F29802 19489
> 06FA3FFA Interactive 3D User Interfaces for Neuroanatomy Exploration
> interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times;
> aborting job
> ^M[Stage 1:> (0 + 1)
> / 2]
> {code}
> For a small sample (<10,000 lines) of the data, I am not getting any error.
> But as soon as I go above more than 100,000 samples, I start getting the
> error.
> I don't think the spark platform should output the actual data to stderr ever
> as it decreases the readability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]