[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

Shubhanshu Mishra (JIRA) Mon, 28 Mar 2016 15:42:03 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215027#comment-15215027
 ]


Shubhanshu Mishra commented on SPARK-14103:
-------------------------------------------

Yes, the error does say so. However, I have checked the file. It has `\r\n` 
style line endings and otherwise looks perfectly file. 

I am able to process the file correctly using cut but am not able to do it 
using the CSV format reader. The maximum number of characters in a given line 
in my file are 697 with 97 as the minimum. 

It looks like the line ending characters are causing an issue and spark is 
normalizing the line ending character to be just "\n" instead of "\r\n". 

I tried to run my command with the following settings and was able to prove 
this intuition:

{code}
In [2]: df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true", delimiter="\t", rowSeparator="\r\n", inputBufferSize=500, 
maxColumns=20, maxCharsPerColumn=1000)
16/03/28 17:17:39 ERROR Executor: Exception in task 1.0 in stage 3.0 (TID 5)    
com.univocity.parsers.common.TextParsingException: Error processing input: 
Length of parsed input (1001) exceeds the maximum number of characters defined 
in your parser settings (1000). 
Identified line separator characters in the parsed content. This may be the 
cause of the error. The line separator in your parser settings is set to '\n'. 
Parsed content:
        I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation    i did it my way moving away from the tyranny of turn 
by turn pedestrian navigation      2010  2010/09/07       
10.1145/1851600.1851660 international conference on human computer interaction  
interact                43331058        18871[\n]
        770CA612        Fixed in time and "time in motion": mobility of vision 
through a SenseCam lens  fixed in time and time in motion mobility of vision 
through a sensecam lens     2009  2009/09/15       10.1145/1613858.1613861 
international conference on human computer interaction  interact                
43331058        19370[\n]
        7B5DE5DE        Assistive Wearable Technology for Visually Impaired     
assistive wearable technology for visually impaired     2015    2015/08/24      
        international conference on human computer interaction interact         
       43331058        19555[\n]
        085BEC09        HOUDINI: Introducing Object Tracking and Pen 
Recognition for LLP Tabletops      houdini introducing object tracking and pen 
recognition for llp tabletops       2014  2014/06/22       
10.1007/978-3-319-07230-2_23    international c
Parser Configuration: CsvParserSettings:
        Column reordering enabled=true
        Empty value=null
        Header extraction enabled=false
        Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10]
        Ignore leading whitespaces=false
        Ignore trailing whitespaces=false
        Input buffer size=128
        Input reading on separate thread=false
        Line separator detection enabled=false
        Maximum number of characters per column=1000
        Maximum number of columns=20
        Null value=
        Number of records to read=all
        Parse unescaped quotes=true
        Row processor=none
        Selected fields=none
        Skip empty lines=trueFormat configuration:
        CsvFormat:
                Comment character=\0
                Field delimiter=\t
                Line separator (normalized)=\n
                Line separator sequence=\n
                Quote character="
                Quote escape character=quote escape
                Quote escape escape character=\0, line=36, char=9828. Content 
parsed: [I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation     i did it my way moving away from the tyranny of turn 
by turn pedestrian navigation     2010    2010/09/07      
10.1145/1851600.1851660 international conference on human computer interaction  
interact      43331058 18871
770CA612        Fixed in time and "time in motion": mobility of vision through 
a SenseCam lens  fixed in time and time in motion mobility of vision through a 
sensecam lens     2009    2009/09/15     10.1145/1613858.1613861 international 
conference on human computer interaction  interact                43331058      
  19370
7B5DE5DE        Assistive Wearable Technology for Visually Impaired     
assistive wearable technology for visually impaired     2015    2015/08/24      
        international conference on human computer interaction interact         
       43331058        19555
085BEC09        HOUDINI: Introducing Object Tracking and Pen Recognition for 
LLP Tabletops      houdini introducing object tracking and pen recognition for 
llp tabletops       2014    2014/06/22     10.1007/978-3-319-07230-2_23    
international c]
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
        at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
        at org.apache.spark.scheduler.Task.run(Task.scala:82)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException
16/03/28 17:17:39 ERROR TaskSetManager: Task 1 in stage 3.0 failed 1 times; 
aborting job
16/03/28 17:17:39 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 4)
com.univocity.parsers.common.TextParsingException: Error processing input: 
org.apache.spark.TaskKilledException - null
Parser Configuration: CsvParserSettings:
        Column reordering enabled=true
        Empty value=null
        Header extraction enabled=false
        Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10]
        Ignore leading whitespaces=false
        Ignore trailing whitespaces=false
        Input buffer size=128
        Input reading on separate thread=false
        Line separator detection enabled=false
        Maximum number of characters per column=1000
        Maximum number of columns=20
        Null value=
        Number of records to read=all
        Parse unescaped quotes=true
        Row processor=none
        Selected fields=none
        Skip empty lines=trueFormat configuration:
        CsvFormat:
                Comment character=\0
                Field delimiter=\t
                Line separator (normalized)=\n
                Line separator sequence=\n
                Quote character="
                Quote escape character=quote escape
                Quote escape escape character=\0, line=706, char=197760. 
Content parsed: [mexic]
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
        at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
        at org.apache.spark.scheduler.Task.run(Task.scala:82)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.TaskKilledException
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
        at 
org.apache.spark.sql.execution.datasources.csv.StringIteratorReader.refill(CSVParser.scala:167)
        at 
org.apache.spark.sql.execution.datasources.csv.StringIteratorReader.read(CSVParser.scala:195)
        at 
org.apache.spark.sql.execution.datasources.csv.StringIteratorReader.read(CSVParser.scala:215)
        at 
com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:81)
        at 
com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:118)
        at 
com.univocity.parsers.common.input.AbstractCharInputReader.nextChar(AbstractCharInputReader.java:180)
        at com.univocity.parsers.csv.CsvParser.parseValue(CsvParser.java:94)
        at com.univocity.parsers.csv.CsvParser.parseField(CsvParser.java:179)
        at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:75)
        at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:328)
        ... 18 more
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-2-cdfcc501837a> in <module>()
----> 1 df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true", delimiter="\t", rowSeparator="\r\n", inputBufferSize=500, 
maxColumns=20, maxCharsPerColumn=1000)

/spark/python/pyspark/sql/readwriter.pyc in load(self, path, format, schema, 
**options)
    133             if type(path) != list:
    134                 path = [path]
--> 135             return 
self._df(self._jreader.load(self._sqlContext._sc._jvm.PythonUtils.toSeq(path)))
    136         else:
    137             return self._df(self._jreader.load())

/spark/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py in __call__(self, 
*args)
    834         answer = self.gateway_client.send_command(command)
    835         return_value = get_return_value(
--> 836             answer, self.gateway_client, self.target_id, self.name)
    837 
    838         for temp_arg in temp_args:

/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/spark/python/lib/py4j-0.9.2-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
    308                 raise Py4JJavaError(
    309                     "An error occurred while calling {0}{1}{2}.\n".
--> 310                     format(target_id, ".", name), value)
    311             else:
    312                 raise Py4JError(

Py4JJavaError: An error occurred while calling o44.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 3.0 failed 1 times, most recent failure: Lost task 1.0 in stage 3.0 (TID 
5, localhost): com.univocity.parsers.common.TextParsingException: Error 
processing input: Length of parsed input (1001) exceeds the maximum number of 
characters defined in your parser settings (1000). 
Identified line separator characters in the parsed content. This may be the 
cause of the error. The line separator in your parser settings is set to '\n'. 
Parsed content:
        I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation    i did it my way moving away from the tyranny of turn 
by turn pedestrian navigation      2010  2010/09/07       
10.1145/1851600.1851660 international conference on human computer interaction  
interact                43331058        18871[\n]
        770CA612        Fixed in time and "time in motion": mobility of vision 
through a SenseCam lens  fixed in time and time in motion mobility of vision 
through a sensecam lens     2009  2009/09/15       10.1145/1613858.1613861 
international conference on human computer interaction  interact                
43331058        19370[\n]
        7B5DE5DE        Assistive Wearable Technology for Visually Impaired     
assistive wearable technology for visually impaired     2015    2015/08/24      
        international conference on human computer interaction interact         
       43331058        19555[\n]
        085BEC09        HOUDINI: Introducing Object Tracking and Pen 
Recognition for LLP Tabletops      houdini introducing object tracking and pen 
recognition for llp tabletops       2014  2014/06/22       
10.1007/978-3-319-07230-2_23    international c
Parser Configuration: CsvParserSettings:
        Column reordering enabled=true
        Empty value=null
        Header extraction enabled=false
        Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10]
        Ignore leading whitespaces=false
        Ignore trailing whitespaces=false
        Input buffer size=128
        Input reading on separate thread=false
        Line separator detection enabled=false
        Maximum number of characters per column=1000
        Maximum number of columns=20
        Null value=
        Number of records to read=all
        Parse unescaped quotes=true
        Row processor=none
        Selected fields=none
        Skip empty lines=trueFormat configuration:
        CsvFormat:
                Comment character=\0
                Field delimiter=\t
                Line separator (normalized)=\n
                Line separator sequence=\n
                Quote character="
                Quote escape character=quote escape
                Quote escape escape character=\0, line=36, char=9828. Content 
parsed: [I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation     i did it my way moving away from the tyranny of turn 
by turn pedestrian navigation     2010    2010/09/07      
10.1145/1851600.1851660 international conference on human computer interaction  
interact      43331058 18871
770CA612        Fixed in time and "time in motion": mobility of vision through 
a SenseCam lens  fixed in time and time in motion mobility of vision through a 
sensecam lens     2009    2009/09/15     10.1145/1613858.1613861 international 
conference on human computer interaction  interact                43331058      
  19370
7B5DE5DE        Assistive Wearable Technology for Visually Impaired     
assistive wearable technology for visually impaired     2015    2015/08/24      
        international conference on human computer interaction interact         
       43331058        19555
085BEC09        HOUDINI: Introducing Object Tracking and Pen Recognition for 
LLP Tabletops      houdini introducing object tracking and pen recognition for 
llp tabletops       2014    2014/06/22     10.1007/978-3-319-07230-2_23    
international c]
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
        at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
        at org.apache.spark.scheduler.Task.run(Task.scala:82)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException

Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1457)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1445)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1444)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1444)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
        at scala.Option.foreach(Option.scala:257)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1666)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1625)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1614)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1765)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1828)
        at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1060)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
        at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1053)
        at 
org.apache.spark.sql.execution.datasources.csv.CSVInferSchema$.infer(CSVInferSchema.scala:48)
        at 
org.apache.spark.sql.execution.datasources.csv.DefaultSource.inferSchema(DefaultSource.scala:69)
        at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:292)
        at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:292)
        at scala.Option.orElse(Option.scala:289)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:291)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:162)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:290)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.univocity.parsers.common.TextParsingException: Error processing 
input: Length of parsed input (1001) exceeds the maximum number of characters 
defined in your parser settings (1000). 
Identified line separator characters in the parsed content. This may be the 
cause of the error. The line separator in your parser settings is set to '\n'. 
Parsed content:
        I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation    i did it my way moving away from the tyranny of turn 
by turn pedestrian navigation      2010  2010/09/07       
10.1145/1851600.1851660 international conference on human computer interaction  
interact                43331058        18871[\n]
        770CA612        Fixed in time and "time in motion": mobility of vision 
through a SenseCam lens  fixed in time and time in motion mobility of vision 
through a sensecam lens     2009  2009/09/15       10.1145/1613858.1613861 
international conference on human computer interaction  interact                
43331058        19370[\n]
        7B5DE5DE        Assistive Wearable Technology for Visually Impaired     
assistive wearable technology for visually impaired     2015    2015/08/24      
        international conference on human computer interaction interact         
       43331058        19555[\n]
        085BEC09        HOUDINI: Introducing Object Tracking and Pen 
Recognition for LLP Tabletops      houdini introducing object tracking and pen 
recognition for llp tabletops       2014  2014/06/22       
10.1007/978-3-319-07230-2_23    international c
Parser Configuration: CsvParserSettings:
        Column reordering enabled=true
        Empty value=null
        Header extraction enabled=false
        Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10]
        Ignore leading whitespaces=false
        Ignore trailing whitespaces=false
        Input buffer size=128
        Input reading on separate thread=false
        Line separator detection enabled=false
        Maximum number of characters per column=1000
        Maximum number of columns=20
        Null value=
        Number of records to read=all
        Parse unescaped quotes=true
        Row processor=none
        Selected fields=none
        Skip empty lines=trueFormat configuration:
        CsvFormat:
                Comment character=\0
                Field delimiter=\t
                Line separator (normalized)=\n
                Line separator sequence=\n
                Quote character="
                Quote escape character=quote escape
                Quote escape escape character=\0, line=36, char=9828. Content 
parsed: [I did it my way": moving away from the tyranny of turn-by-turn 
pedestrian navigation     i did it my way moving away from the tyranny of turn 
by turn pedestrian navigation     2010    2010/09/07      
10.1145/1851600.1851660 international conference on human computer interaction  
interact      43331058 18871
770CA612        Fixed in time and "time in motion": mobility of vision through 
a SenseCam lens  fixed in time and time in motion mobility of vision through a 
sensecam lens     2009    2009/09/15     10.1145/1613858.1613861 international 
conference on human computer interaction  interact                43331058      
  19370
7B5DE5DE        Assistive Wearable Technology for Visually Impaired     
assistive wearable technology for visually impaired     2015    2015/08/24      
        international conference on human computer interaction interact         
       43331058        19555
085BEC09        HOUDINI: Introducing Object Tracking and Pen Recognition for 
LLP Tabletops      houdini introducing object tracking and pen recognition for 
llp tabletops       2014    2014/06/22     10.1007/978-3-319-07230-2_23    
international c]
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
        at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
        at org.apache.spark.scheduler.Task.run(Task.scala:82)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        ... 1 more
Caused by: java.lang.ArrayIndexOutOfBoundsException

{code}


Can we file this as a relevant issue of the CSV reader?



> Python DataFrame CSV load on large file is writing to console in Ipython
> ------------------------------------------------------------------------
>
>                 Key: SPARK-14103
>                 URL: https://issues.apache.org/jira/browse/SPARK-14103
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>         Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
> branch
>            Reporter: Shubhanshu Mishra
>              Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following 
> command on a large tab separated file then I get the contents of the file 
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false", 
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:>                                                          (0 + 2) 
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input: 
> Length of parsed input (1000001) exceeds the maximum number of characters 
> defined in your parser settings (1000000). Identified line separator 
> characters in the parsed content. This may be the cause of the error. The 
> line separator in your parser settings is set to '\n'. Parsed content:
>         Privacy-shake",: a haptic interface for managing privacy settings in 
> mobile location sharing applications       privacy shake a haptic interface 
> for managing privacy settings in mobile location sharing applications  2010   
>  2010/09/07              international conference on human computer 
> interaction  interact                43331058        19371[\n]        
> 3D4F6CA1        Between the Profiles: Another such Bias. Technology 
> Acceptance Studies on Social Network Services       between the profiles 
> another such bias technology acceptance studies on social network services 
> 2015    2015/08/02      10.1007/978-3-319-21383-5_12    international 
> conference on human-computer interaction  interact                43331058    
>     19502[\n]
> .......
> .........
> web snippets    2008    2008/05/04      10.1007/978-3-642-01344-7_13    
> international conference on web information systems and technologies    
> webist          44F29802        19489
> 06FA3FFA        Interactive 3D User Interfaces for Neuroanatomy Exploration   
>   interactive 3d user interfaces for neuroanatomy exploration     2009        
>             internationa]
>         at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
>         at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
>         at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
>         at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
>         at 
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
>         at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
>         at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
>         at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
>         at 
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>         at org.apache.spark.scheduler.Task.run(Task.scala:82)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
> aborting job
> ^M[Stage 1:>                                                          (0 + 1) 
> / 2]
> {code}
> For a small sample (<10,000 lines) of the data, I am not getting any error. 
> But as soon as I go above more than 100,000 samples, I start getting the 
> error. 
> I don't think the spark platform should output the actual data to stderr ever 
> as it decreases the readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

Reply via email to