[
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-14103.
-------------------------------
Resolution: Not A Problem
I think you just have a line separator problem or similar... it's complaining
that it can't parse it because the whole thing is one big line.
> Python DataFrame CSV load on large file is writing to console in Ipython
> ------------------------------------------------------------------------
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master
> branch
> Reporter: Shubhanshu Mishra
> Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following
> command on a large tab separated file then I get the contents of the file
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false",
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:> (0 + 2)
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input:
> Length of parsed input (1000001) exceeds the maximum number of characters
> defined in your parser settings (1000000). Identified line separator
> characters in the parsed content. This may be the cause of the error. The
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in
> mobile location sharing applications privacy shake a haptic interface
> for managing privacy settings in mobile location sharing applications 2010
> 2010/09/07 international conference on human computer
> interaction interact 43331058 19371[\n]
> 3D4F6CA1 Between the Profiles: Another such Bias. Technology
> Acceptance Studies on Social Network Services between the profiles
> another such bias technology acceptance studies on social network services
> 2015 2015/08/02 10.1007/978-3-319-21383-5_12 international
> conference on human-computer interaction interact 43331058
> 19502[\n]
> .......
> .........
> web snippets 2008 2008/05/04 10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist 44F29802 19489
> 06FA3FFA Interactive 3D User Interfaces for Neuroanatomy Exploration
> interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times;
> aborting job
> ^M[Stage 1:> (0 + 1)
> / 2]
> {code}
> For a small sample (<10,000 lines) of the data, I am not getting any error.
> But as soon as I go above more than 100,000 samples, I start getting the
> error.
> I don't think the spark platform should output the actual data to stderr ever
> as it decreases the readability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]