Shubhanshu Mishra created SPARK-14103:
-----------------------------------------
Summary: Python DataFrame CSV load on large file is writing to
console in Ipython
Key: SPARK-14103
URL: https://issues.apache.org/jira/browse/SPARK-14103
Project: Spark
Issue Type: Bug
Components: PySpark
Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master
branch
Reporter: Shubhanshu Mishra
I am using the spark from the master branch and when I run the following
command on a large tab separated file then I get the contents of the file being
written to the stderr
{code}
df = sqlContext.read.load("temp.txt", format="csv", header="false",
inferSchema="true", delimiter="\t")
{code}
Here is a sample of output:
{code}
^M[Stage 1:> (0 + 2) /
2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
com.univocity.parsers.common.TextParsingException: Error processing input:
Length of parsed input (1000001) exceeds the maximum number of characters
defined in your parser settings (1000000). Identified line separator characters
in the parsed content. This may be the cause of the error. The line separator
in your parser settings is set to '\n'. Parsed content:
Privacy-shake",: a haptic interface for managing privacy settings in
mobile location sharing applications privacy shake a haptic interface for
managing privacy settings in mobile location sharing applications 2010
2010/09/07 international conference on human computer interaction
interact 43331058 19371[\n] 3D4F6CA1
Between the Profiles: Another such Bias. Technology Acceptance Studies on
Social Network Services between the profiles another such bias technology
acceptance studies on social network services 2015 2015/08/02
10.1007/978-3-319-21383-5_12 international conference on human-computer
interaction interact 43331058 19502[\n]
.......
.........
web snippets 2008 2008/05/04 10.1007/978-3-642-01344-7_13
international conference on web information systems and technologies webist
44F29802 19489
06FA3FFA Interactive 3D User Interfaces for Neuroanatomy Exploration
interactive 3d user interfaces for neuroanatomy exploration 2009
internationa]
at
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
at
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
at
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
at
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at
org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException
16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times;
aborting job
^M[Stage 1:> (0 + 1) /
2]
{code}
For a small sample (<10,000 lines) of the data, I am not getting any error. But
as soon as I go above more than 100,000 samples, I start getting the error.
I don't think the spark platform should output the actual data to stderr ever
as it decreases the readability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]