[jira] [Created] (SPARK-11544) sqlContext doesn't use PathFilter

Frank Dai (JIRA) Thu, 05 Nov 2015 18:03:06 -0800

Frank Dai created SPARK-11544:
---------------------------------

             Summary: sqlContext doesn't use PathFilter
                 Key: SPARK-11544
                 URL: https://issues.apache.org/jira/browse/SPARK-11544
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.0
         Environment: AWS EMR 4.1.0, Spark 1.5.0
            Reporter: Frank Dai



When sqlContext reads JSON files, it doesn't use PathFilter in the underlying 
SparkContext

{code:java}
val sc = new SparkContext(conf)
sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", 
classOf[TmpFileFilter], classOf[PathFilter])
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
{code}

The definition of {{TmpFileFilter}} is:
{code:title=TmpFileFilter.scala|borderStyle=solid}
import org.apache.hadoop.fs.{Path, PathFilter}

class TmpFileFilter  extends PathFilter {
  override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
}
{code}

When use {{sqlContext}} to read JSON files, e.g., 
{{sqlContext.read.schema(mySchema).json(s3Path)}}, Spark will throw out an 
exception:
{quote}
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
s3://chef-logstash-access-backup/2015/10/21/00/logstash-172.18.68.59-s3.1445388158944.gz.tmp
{quote}

It seems {{sqlContext}} can see {{.tmp}} files while {{sc}} can not, which 
causes the above exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-11544) sqlContext doesn't use PathFilter

Reply via email to