Maxim Gekk created SPARK-24068: ---------------------------------- Summary: CSV schema inferring doesn't work for compressed files Key: SPARK-24068 URL: https://issues.apache.org/jira/browse/SPARK-24068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk
Here is a simple csv file compressed by lzo {code} $ cat ./test.csv col1,col2 a,1 $ lzop ./test.csv $ ls test.csv test.csv.lzo {code} Reading test.csv.lzo with LZO codec (see https://github.com/twitter/hadoop-lzo, for example): {code:scala} scala> val ds = spark.read.option("header", true).option("inferSchema", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") ds: org.apache.spark.sql.DataFrame = [�LZO?: string] scala> ds.printSchema root |-- �LZO: string (nullable = true) scala> ds.show +-----+ |�LZO| +-----+ | a| +-----+ {code} but the file can be read if the schema is specified: {code} scala> import org.apache.spark.sql.types._ scala> val schema = new StructType().add("col1", StringType).add("col2", IntegerType) scala> val ds = spark.read.schema(schema).option("header", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") scala> ds.show +----+----+ |col1|col2| +----+----+ | a| 1| +----+----+ {code} Just in case, schema inferring works for the original uncompressed file: {code:scala} scala> spark.read.option("header", true).option("inferSchema", true).csv("test.csv").printSchema root |-- col1: string (nullable = true) |-- col2: integer (nullable = true) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org