[jira] [Created] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

Shubhanshu Mishra (JIRA) Wed, 23 Mar 2016 13:24:12 -0700

Shubhanshu Mishra created SPARK-14103:
-----------------------------------------


             Summary: Python DataFrame CSV load on large file is writing to 
console in Ipython
                 Key: SPARK-14103
                 URL: https://issues.apache.org/jira/browse/SPARK-14103
             Project: Spark
          Issue Type: Bug
          Components: PySpark
         Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master 
branch
            Reporter: Shubhanshu Mishra


I am using the spark from the master branch and when I run the following 
command on a large tab separated file then I get the contents of the file being 
written to the stderr

{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false", 
inferSchema="true", delimiter="\t")
{code}

Here is a sample of output:

{code}
^M[Stage 1:>                                                          (0 + 2) / 
2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
com.univocity.parsers.common.TextParsingException: Error processing input: 
Length of parsed input (1000001) exceeds the maximum number of characters 
defined in your parser settings (1000000). Identified line separator characters 
in the parsed content. This may be the cause of the error. The line separator 
in your parser settings is set to '\n'. Parsed content:
        Privacy-shake",: a haptic interface for managing privacy settings in 
mobile location sharing applications       privacy shake a haptic interface for 
managing privacy settings in mobile location sharing applications  2010    
2010/09/07              international conference on human computer interaction  
interact                43331058        19371[\n]        3D4F6CA1        
Between the Profiles: Another such Bias. Technology Acceptance Studies on 
Social Network Services       between the profiles another such bias technology 
acceptance studies on social network services 2015    2015/08/02      
10.1007/978-3-319-21383-5_12    international conference on human-computer 
interaction  interact                43331058        19502[\n]

.......

.........

web snippets    2008    2008/05/04      10.1007/978-3-642-01344-7_13    
international conference on web information systems and technologies    webist  
        44F29802        19489
06FA3FFA        Interactive 3D User Interfaces for Neuroanatomy Exploration     
interactive 3d user interfaces for neuroanatomy exploration     2009            
        internationa]
        at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
        at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
        at scala.collection.Iterator$class.foreach(Iterator.scala:742)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
        at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
        at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at 
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
        at org.apache.spark.scheduler.Task.run(Task.scala:82)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException
16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; 
aborting job
^M[Stage 1:>                                                          (0 + 1) / 
2]


{code}


For a small sample (<10,000 lines) of the data, I am not getting any error. But 
as soon as I go above more than 100,000 samples, I start getting the error. 

I don't think the spark platform should output the actual data to stderr ever 
as it decreases the readability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython

Reply via email to