[
https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222792#comment-15222792
]
Hyukjin Kwon edited comment on SPARK-14103 at 4/2/16 1:51 PM:
--------------------------------------------------------------
[[email protected]] Right, it looks like an issue in Univocity Parser.
I could reproduce this error with the data below:
{code}
"a"b ccc ddd
{code}
and code below:
{code}
val path = "temp.tsv"
sqlContext.read
.format("csv")
.option("maxCharsPerColumn", "4")
.option("delimiter", "\t")
.load(path)
{code}
It looks Univocity parser gets confused when it meets {{quote}} character
during parsing a value and the value does not end with the character. It just
treats the entire rows and values as a quoted value as a value afterward when
this happens.
So, it looks your data has such rows, for example below:
{code}
7C0E15CD "I did it my way": moving away from the tyranny of turn-by-turn
pedestrian navigation i did it my way moving away from the tyranny of turn by
turn pedestrian navigation 2010 2010/09/07 10.1145/1851600.1851660
international conference on human computer interaction interact
43331058 18871
{code}
All the data after {{"I did it my way}} was being treated as a quoted value.
[~sowen] Actually, for me it has been a bit questionable for the use of
Univocity parser. It looks it is generally true that this library itself is
faster then Apache CSV parser but it brought complexity of codes and there are
pretty messy additional logics to use Univocity for now. Also, it became pretty
difficult to figure out such issues.
I am thinking about changing Univocity to Apahce CSV parser after performance
tests. Do you think this makes sense?
was (Author: hyukjin.kwon):
[[email protected]] Right, it looks like an issue in Univocity Parser.
I could reproduce this error with the data below:
{code}
"a"b ccc ddd
{code}
and code below:
{code}
val path = "temp.tsv"
sqlContext.read
.format("csv")
.option("maxCharsPerColumn", "4")
.option("delimiter", "\t")
.load(path)
{code}
It looks Univocity parser gets confused when it meets {{quote}} character
during parsing a value and the value does not end with the character. It just
treats the entire rows and values as a quoted value as a value afterward when
this happens.
So, it looks your data has such rows, for example below:
{code}
7C0E15CD "I did it my way": moving away from the tyranny of turn-by-turn
pedestrian navigation i did it my way moving away from the tyranny of turn by
turn pedestrian navigation 2010 2010/09/07 10.1145/1851600.1851660
international conference on human computer interaction interact
43331058 18871
{code}
All the data after {{"I did it my way}} was being treated as a quoted value.
[~sowen] Actually, I have been a bit questionable for the use of Univocity
parser. It looks it is generally true that this library itself is faster then
Apache CSV parser but it brought complexity of codes and there are pretty messy
additional logics to use Univocity for now. Also, it became pretty difficult to
figure out such issues.
I am thinking about changing Univocity to Apahce CSV parser after performance
tests. Do you think this makes sense?
> Python DataFrame CSV load on large file is writing to console in Ipython
> ------------------------------------------------------------------------
>
> Key: SPARK-14103
> URL: https://issues.apache.org/jira/browse/SPARK-14103
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master
> branch
> Reporter: Shubhanshu Mishra
> Labels: csv, csvparser, dataframe, pyspark
>
> I am using the spark from the master branch and when I run the following
> command on a large tab separated file then I get the contents of the file
> being written to the stderr
> {code}
> df = sqlContext.read.load("temp.txt", format="csv", header="false",
> inferSchema="true", delimiter="\t")
> {code}
> Here is a sample of output:
> {code}
> ^M[Stage 1:> (0 + 2)
> / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID
> 2)
> com.univocity.parsers.common.TextParsingException: Error processing input:
> Length of parsed input (1000001) exceeds the maximum number of characters
> defined in your parser settings (1000000). Identified line separator
> characters in the parsed content. This may be the cause of the error. The
> line separator in your parser settings is set to '\n'. Parsed content:
> Privacy-shake",: a haptic interface for managing privacy settings in
> mobile location sharing applications privacy shake a haptic interface
> for managing privacy settings in mobile location sharing applications 2010
> 2010/09/07 international conference on human computer
> interaction interact 43331058 19371[\n]
> 3D4F6CA1 Between the Profiles: Another such Bias. Technology
> Acceptance Studies on Social Network Services between the profiles
> another such bias technology acceptance studies on social network services
> 2015 2015/08/02 10.1007/978-3-319-21383-5_12 international
> conference on human-computer interaction interact 43331058
> 19502[\n]
> .......
> .........
> web snippets 2008 2008/05/04 10.1007/978-3-642-01344-7_13
> international conference on web information systems and technologies
> webist 44F29802 19489
> 06FA3FFA Interactive 3D User Interfaces for Neuroanatomy Exploration
> interactive 3d user interfaces for neuroanatomy exploration 2009
> internationa]
> at
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
> at
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
> at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
> at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
> at
> org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
> at
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
> at
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at
> org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
> at org.apache.spark.scheduler.Task.run(Task.scala:82)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times;
> aborting job
> ^M[Stage 1:> (0 + 1)
> / 2]
> {code}
> For a small sample (<10,000 lines) of the data, I am not getting any error.
> But as soon as I go above more than 100,000 samples, I start getting the
> error.
> I don't think the spark platform should output the actual data to stderr ever
> as it decreases the readability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]