[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971893#comment-15971893 ] Stephane Maarek commented on SPARK-20287: - [~c...@koeninger.org] It makes sense. I didn't realized in the direct streams, that the driver was in charge of assigning metadata to the executors to pull data. Therefore yes you're right, it's "incompatible" with the Kafka way of being "master-free", where each consumer really doesn't know and shouldn't care about how many other consumers there are. I think this ticket can now be closed (just re-open it if you don't believe so). Maybe it'll be worth opening a KIP on Kafka to have some APIs to allow Spark to be a bit more "optimized", but it all seems okay for now. Cheers! > Kafka Consumer should be able to subscribe to more than one topic partition > --- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek closed SPARK-20287. --- Resolution: Not A Problem > Kafka Consumer should be able to subscribe to more than one topic partition > --- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966973#comment-15966973 ] Stephane Maarek commented on SPARK-20287: - [~c...@koeninger.org] How about using the subscribe pattern? https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html ``` public void subscribe(Collection topics) Subscribe to the given list of topics to get dynamically assigned partitions. Topic subscriptions are not incremental. This list will replace the current assignment (if there is one). It is not possible to combine topic subscription with group management with manual partition assignment through assign(Collection). If the given list of topics is empty, it is treated the same as unsubscribe(). ``` Then you let Kafka handle the partition assignments? As all the consumers share the same group.id, the data will be effectively distributed between every Spark instance? But then I guess you may have already explored that option and it goes against the Spark DirectStream API? (not a Spark expert, just trying to understand the limitations. I believe you when you say you did it the most straightforward way) > Kafka Consumer should be able to subscribe to more than one topic partition > --- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963939#comment-15963939 ] Stephane Maarek commented on SPARK-20287: - The other issue I can see is the coordinator work that has to re-coordinate XX number of Kafka Consumers should one go down. That's more expensive if you have 100 consumers versus a few. But as you said, it should be performance limitation-driven, right now that'd be speculation. > Kafka Consumer should be able to subscribe to more than one topic partition > --- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963938#comment-15963938 ] Stephane Maarek commented on SPARK-20287: - [~srowen] those are good points. In the case of 100 separate machines on 100 tasks, then I agree you have 100 Kafka Consumers no matter what. I guess as you said, my optimisation would come when you have tasks on the same machine that could share a Kafka Consumer. My concern is as you said the number of connections opened to Kafka that might be high even if not needed. I understand one Kafka Consumer distributing to multiple tasks may bind them together on the receive, and I'm not a Spark expert so I can't measure the implications of that on performance. My concern then is with the spark.streaming.kafka.consumer.cache.maxCapacity parameter. Is that truly needed? Say one executor consumes from 100 partitions, do we really need to have a maxCapacity parameter? The executor should just spin as many consumers as needed ? Same, in a distributed context, can't the individual executors figure out how many kafka consumers they need? Thanks for the discussion, I appreciate it > Kafka Consumer should be able to subscribe to more than one topic partition > --- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
Stephane Maarek created SPARK-20287: --- Summary: Kafka Consumer should be able to subscribe to more than one topic partition Key: SPARK-20287 URL: https://issues.apache.org/jira/browse/SPARK-20287 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Stephane Maarek As I understand and as it stands, one Kafka Consumer is created for each topic partition in the source Kafka topics, and they're cached. cf https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 In my opinion, that makes the design an anti pattern for Kafka and highly unefficient: - Each Kafka consumer creates a connection to Kafka - Spark doesn't leverage the power of the Kafka consumers, which is that it automatically assigns and balances partitions amongst all the consumers that share the same group.id - You can still cache your Kafka consumer even if it has multiple partitions. I'm not sure about how that translates to the spark underlying RDD architecture, but from a Kafka standpoint, I believe creating one consumer per partition is a big overhead, and a risk as the user may have to increase the spark.streaming.kafka.consumer.cache.maxCapacity parameter. Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604903#comment-15604903 ] Stephane Maarek commented on SPARK-18068: - Thanks for the links guys! Really helpful > Spark SQL doesn't parse some ISO 8601 formatted dates > - > > Key: SPARK-18068 > URL: https://issues.apache.org/jira/browse/SPARK-18068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stephane Maarek >Priority: Minor > > The following fail, but shouldn't according to the ISO 8601 standard (seconds > can be omitted). Not sure where the issue lies (probably an external library?) > {code} > scala> sc.parallelize(Seq("2016-10-07T11:15Z")) > res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :25 > scala> res1.toDF > res2: org.apache.spark.sql.DataFrame = [value: string] > scala> res2.select("value").show() > +-+ > |value| > +-+ > |2016-10-07T11:15Z| > +-+ > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> res2.select(col("value").cast(TimestampType)).show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > And the schema usage errors out right away: > {code} > scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) > jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at > parallelize at :33 > scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(tst,TimestampType,true)) > scala> val df = spark.read.schema(schema).json(jsonRDD) > df: org.apache.spark.sql.DataFrame = [tst: timestamp] > scala> df.show() > 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > java.lang.IllegalArgumentException: 2016-10-07T11:15Z > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) >
[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604443#comment-15604443 ] Stephane Maarek commented on SPARK-18068: - Would be awesome to expose it in the only docs! > Spark SQL doesn't parse some ISO 8601 formatted dates > - > > Key: SPARK-18068 > URL: https://issues.apache.org/jira/browse/SPARK-18068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stephane Maarek >Priority: Minor > > The following fail, but shouldn't according to the ISO 8601 standard (seconds > can be omitted). Not sure where the issue lies (probably an external library?) > {code} > scala> sc.parallelize(Seq("2016-10-07T11:15Z")) > res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :25 > scala> res1.toDF > res2: org.apache.spark.sql.DataFrame = [value: string] > scala> res2.select("value").show() > +-+ > |value| > +-+ > |2016-10-07T11:15Z| > +-+ > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> res2.select(col("value").cast(TimestampType)).show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > And the schema usage errors out right away: > {code} > scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) > jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at > parallelize at :33 > scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(tst,TimestampType,true)) > scala> val df = spark.read.schema(schema).json(jsonRDD) > df: org.apache.spark.sql.DataFrame = [tst: timestamp] > scala> df.show() > 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > java.lang.IllegalArgumentException: 2016-10-07T11:15Z > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85
[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604400#comment-15604400 ] Stephane Maarek commented on SPARK-18068: - [~hyukjin.kwon] Thanks! Didn't see this option, where is it documented? It's a nice workaround for the time being, would be good if that was a default implementation to support any ISO 8601 compliant timestamps :) > Spark SQL doesn't parse some ISO 8601 formatted dates > - > > Key: SPARK-18068 > URL: https://issues.apache.org/jira/browse/SPARK-18068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stephane Maarek >Priority: Minor > > The following fail, but shouldn't according to the ISO 8601 standard (seconds > can be omitted). Not sure where the issue lies (probably an external library?) > {code} > scala> sc.parallelize(Seq("2016-10-07T11:15Z")) > res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :25 > scala> res1.toDF > res2: org.apache.spark.sql.DataFrame = [value: string] > scala> res2.select("value").show() > +-+ > |value| > +-+ > |2016-10-07T11:15Z| > +-+ > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> res2.select(col("value").cast(TimestampType)).show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > And the schema usage errors out right away: > {code} > scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) > jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at > parallelize at :33 > scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(tst,TimestampType,true)) > scala> val df = spark.read.schema(schema).json(jsonRDD) > df: org.apache.spark.sql.DataFrame = [tst: timestamp] > scala> df.show() > 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > java.lang.IllegalArgumentException: 2016-10-07T11:15Z > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.
[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600732#comment-15600732 ] Stephane Maarek commented on SPARK-18068: - I see TimestampType is a wrapper for java.sql.Timestamp It seems that it can't parse a string without seconds. {code} scala> import java.sql.Timestamp import java.sql.Timestamp scala> Timestamp.valueOf("2016-10-07T11:15Z") java.lang.IllegalArgumentException: Timestamp format must be -mm-dd hh:mm:ss[.f] at java.sql.Timestamp.valueOf(Timestamp.java:204) ... 32 elided {code} A workaround would be to first convert to a date using the good Java 8 API and then passing it to the java.sql.Timestamp class > Spark SQL doesn't parse some ISO 8601 formatted dates > - > > Key: SPARK-18068 > URL: https://issues.apache.org/jira/browse/SPARK-18068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stephane Maarek > > The following fail, but shouldn't according to the ISO 8601 standard (seconds > can be omitted). Not sure where the issue lies (probably an external library?) > {code} > scala> sc.parallelize(Seq("2016-10-07T11:15Z")) > res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :25 > scala> res1.toDF > res2: org.apache.spark.sql.DataFrame = [value: string] > scala> res2.select("value").show() > +-+ > |value| > +-+ > |2016-10-07T11:15Z| > +-+ > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> res2.select(col("value").cast(TimestampType)).show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > And the schema usage errors out right away: > {code} > scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) > jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at > parallelize at :33 > scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(tst,TimestampType,true)) > scala> val df = spark.read.schema(schema).json(jsonRDD) > df: org.apache.spark.sql.DataFrame = [tst: timestamp] > scala> df.show() > 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > java.lang.IllegalArgumentException: 2016-10-07T11:15Z > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.sp
[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-18068: Priority: Major (was: Critical) > Spark SQL doesn't parse some ISO 8601 formatted dates > - > > Key: SPARK-18068 > URL: https://issues.apache.org/jira/browse/SPARK-18068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stephane Maarek > > The following fail, but shouldn't according to the ISO 8601 standard (seconds > can be omitted). Not sure where the issue lies (probably an external library?) > {code} > scala> sc.parallelize(Seq("2016-10-07T11:15Z")) > res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at :25 > scala> res1.toDF > res2: org.apache.spark.sql.DataFrame = [value: string] > scala> res2.select("value").show() > +-+ > |value| > +-+ > |2016-10-07T11:15Z| > +-+ > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> res2.select(col("value").cast(TimestampType)).show() > +-+ > |value| > +-+ > | null| > +-+ > {code} > And the schema usage errors out right away: > {code} > scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) > jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at > parallelize at :33 > scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(tst,TimestampType,true)) > scala> val df = spark.read.schema(schema).json(jsonRDD) > df: org.apache.spark.sql.DataFrame = [tst: timestamp] > scala> df.show() > 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > java.lang.IllegalArgumentException: 2016-10-07T11:15Z > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown > Source) > at > org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown > Source) > at > javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) > at > javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) > at > javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) > at > org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java
[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-18068: Description: The following fail, but shouldn't according to the ISO 8601 standard (seconds can be omitted). Not sure where the issue lies (probably an external library?) {code} scala> sc.parallelize(Seq("2016-10-07T11:15Z")) res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :25 scala> res1.toDF res2: org.apache.spark.sql.DataFrame = [value: string] scala> res2.select("value").show() +-+ |value| +-+ |2016-10-07T11:15Z| +-+ scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> res2.select(col("value").cast(TimestampType)).show() +-+ |value| +-+ | null| +-+ {code} And the schema usage errors out right away: {code} scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at :33 scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) schema: org.apache.spark.sql.types.StructType = StructType(StructField(tst,TimestampType,true)) scala> val df = spark.read.schema(schema).json(jsonRDD) df: org.apache.spark.sql.DataFrame = [tst: timestamp] scala> df.show() 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source) at org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown Source) at javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) at javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) at javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, localhost): java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGrego
[jira] [Updated] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
[ https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-18068: Description: The following fail, but shouldn't according to the ISO 8601 standard (seconds can be omitted). Not sure where the issue lies (probably an external library?) {code:scala} scala> sc.parallelize(Seq("2016-10-07T11:15Z")) res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :25 scala> res1.toDF res2: org.apache.spark.sql.DataFrame = [value: string] scala> res2.select("value").show() +-+ |value| +-+ |2016-10-07T11:15Z| +-+ scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> res2.select(col("value").cast(TimestampType)).show() +-+ |value| +-+ | null| +-+ {code} And the schema usage errors out right away: {code:scala} scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at :33 scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) schema: org.apache.spark.sql.types.StructType = StructType(StructField(tst,TimestampType,true)) scala> val df = spark.read.schema(schema).json(jsonRDD) df: org.apache.spark.sql.DataFrame = [tst: timestamp] scala> df.show() 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source) at org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown Source) at javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) at javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) at javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, localhost): java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datat
[jira] [Created] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates
Stephane Maarek created SPARK-18068: --- Summary: Spark SQL doesn't parse some ISO 8601 formatted dates Key: SPARK-18068 URL: https://issues.apache.org/jira/browse/SPARK-18068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Stephane Maarek Priority: Critical The following fail, but shouldn't according to the ISO 8601 standard (seconds can be omitted). Not sure where the issue lies (probably an external library?) ``` scala> sc.parallelize(Seq("2016-10-07T11:15Z")) res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :25 scala> res1.toDF res2: org.apache.spark.sql.DataFrame = [value: string] scala> res2.select("value").show() +-+ |value| +-+ |2016-10-07T11:15Z| +-+ scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> res2.select(col("value").cast(TimestampType)).show() +-+ |value| +-+ | null| +-+ ``` And the schema errors out right away: ``` scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}""")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at :33 scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil) schema: org.apache.spark.sql.types.StructType = StructType(StructField(tst,TimestampType,true)) scala> val df = spark.read.schema(schema).json(jsonRDD) df: org.apache.spark.sql.DataFrame = [tst: timestamp] scala> df.show() 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) java.lang.IllegalArgumentException: 2016-10-07T11:15Z at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source) at org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown Source) at javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) at javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) at javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182) at org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285) at org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/10/24 13:06:27 WARN TaskSetManager: Lost task 6.0 in stage 5.0 (TID 23, localhost): ja
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420491#comment-15420491 ] Stephane Maarek commented on SPARK-11374: - Hi, Thanks for the PR. Can you also test for the footer option? Might as well solve both issues Thanks Stéphane > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244986#comment-15244986 ] Stephane Maarek commented on SPARK-14586: - Hi [~tsuresh], thanks for your reply. It makes sense! I'm using Hive 1.2.1. My only concern is that looking at the code, I understand why the number wouldn't be parsed correctly in Spark and Hive, but I don't understand why Hive 1.2.1 CLI would parse the number correctly (as seen in my troubleshooting)? Isn't Spark using the exact same logic as Hive? > SparkSQL doesn't parse decimal like Hive > > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +++ > |column_1|column_2| > +++ > | a|null| > |null|3.00| > +++ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't > have a similar parsing behavior for decimals. I wouldn't say it is a bug per > se, but it looks like a necessary improvement for the two engines to > converge. Hive version is 1.5.1 > Not sure if relevant, but Scala does parse numbers with leading space > correctly > {code} > scala> "2.0".toDouble > res21: Double = 2.0 > scala> " 2.0".toDouble > res22: Double = 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14583) SparkSQL doesn't apply TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions
[ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14583: Summary: SparkSQL doesn't apply TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions (was: SparkSQL doesn't read TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions) > SparkSQL doesn't apply TBLPROPERTIES('serialization.null.format'='') when > Hive Table has partitions > --- > > Key: SPARK-14583 > URL: https://issues.apache.org/jira/browse/SPARK-14583 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > it seems that Spark forgets or fails to read the metadata tblproperties after > a MSCK REPAIR is issued from within HIVE > Here are the steps to reproduce: > create test_data.csv with the following content: > {code:none} > a,2 > ,3 > {code} > move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ > run the following hive statements: > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv; > CREATE EXTERNAL TABLE `spark_testing.test_csv`( > column_1 varchar(10), > column_2 int) > PARTITIONED BY ( > `part_a` string, > `part_b` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing' > TBLPROPERTIES('serialization.null.format'=''); > MSCK REPAIR TABLE spark_testing.test_csv; > select * from spark_testing.test_csv; > OK > a 2 a b > NULL3 a b > {code} > (you can see the NULL) > now onto Spark: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv").show() > +++--+--+ > |column_1|column_2|part_a|part_b| > +++--+--+ > | a| 2| a| b| > || 3| a| b| > +++--+--+ > {code} > As you can see, SPARK can't detect the null. > I don't know if it affects future versions of SPARK and I can't test it in my > company's environment. Steps are easy to reproduce though so can be tested in > other environments. My hive version is 1.2.1 > Let me know if you have any questions. To me that's a big issue because data > isn't read correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14583) SparkSQL doesn't read TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions
[ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14583: Summary: SparkSQL doesn't read TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions (was: SparkSQL doesn't read hive table properly after MSCK REPAIR) > SparkSQL doesn't read TBLPROPERTIES('serialization.null.format'='') when Hive > Table has partitions > -- > > Key: SPARK-14583 > URL: https://issues.apache.org/jira/browse/SPARK-14583 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > it seems that Spark forgets or fails to read the metadata tblproperties after > a MSCK REPAIR is issued from within HIVE > Here are the steps to reproduce: > create test_data.csv with the following content: > {code:none} > a,2 > ,3 > {code} > move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ > run the following hive statements: > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv; > CREATE EXTERNAL TABLE `spark_testing.test_csv`( > column_1 varchar(10), > column_2 int) > PARTITIONED BY ( > `part_a` string, > `part_b` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing' > TBLPROPERTIES('serialization.null.format'=''); > MSCK REPAIR TABLE spark_testing.test_csv; > select * from spark_testing.test_csv; > OK > a 2 a b > NULL3 a b > {code} > (you can see the NULL) > now onto Spark: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv").show() > +++--+--+ > |column_1|column_2|part_a|part_b| > +++--+--+ > | a| 2| a| b| > || 3| a| b| > +++--+--+ > {code} > As you can see, SPARK can't detect the null. > I don't know if it affects future versions of SPARK and I can't test it in my > company's environment. Steps are easy to reproduce though so can be tested in > other environments. My hive version is 1.2.1 > Let me know if you have any questions. To me that's a big issue because data > isn't read correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242132#comment-15242132 ] Stephane Maarek commented on SPARK-12741: - Hi Sean, What do you mean by the behavior on master? Do you want me to run the query on something different than spark-shell or spark-shell --master yarn --deploy-mode client ? Sorry I'm just starting with these kind of bugs reports and I don't have the expertise to dive down in the Spark code. Thanks for working with me through that Regards, Stephane > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240264#comment-15240264 ] Stephane Maarek commented on SPARK-12741: - Hi Sean, Not sure what you mean... In my SO: when I run against just a spark shell, it works. When I run against YARN with spark-shell --master yarn --deploy-mode client, the outcome of (firstCount, secondCount) is never the same twice. Although, the more I run the count under the same session, eventually it converges to the right result. When I run just a spark-shell, I get the right result right away. I can't share any data I'm sorry... the test case would literally be firstCount = 2395 It also seems that behavior happens on smaller datasets... but I can't tell for sure. I'm sorry I can't help more... If you have detailed steps you'd like me to run or tests you'd like me to run, let me know Regards Stephane > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238723#comment-15238723 ] Stephane Maarek commented on SPARK-12741: - can we please re-open the issue? > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238689#comment-15238689 ] Stephane Maarek commented on SPARK-11374: - any updates on this? Just some log: {code} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'='', "skip.header.line.count"="1"); select * from spark_testing.test_csv_2; hive> select * from spark_testing.test_csv_2; OK NULL3 {code} spark: {code} scala> sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} That's a big problem > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-11374: Comment: was deleted (was: I may add that more metadata isn't processed, namely TBLPROPERTIES ('serialization.null.format'='') Also, another issue (may still be related to Spark not reading Hive Metadata or not properly using Hive), but if you create a csv with the following (spaces intended) 1, 2,3 4, 5,6 use Hive as this: CREATE EXTERNAL TABLE `my_table`( `c1` DECIMAL, `c2` DECIMAL, `c3` DECIMAL) ... etc select * from my_table will return in Hive 1,2,3 4,5,6 But using a hive context, in Spark 1,null,3 4,null,6) > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14586: Description: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge. Hive version is 1.5.1 Not sure if relevant, but Scala does parse numbers with leading space correctly {code} scala> "2.0".toDouble res21: Double = 2.0 scala> " 2.0".toDouble res22: Double = 2.0 {code} was: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge. Hive version is 1.5.1 Not sure if relevant, but Scala does parse numbers with leading space correctly {code} scala> "2.0".toDouble res21: Double = 2.0 scala> " 2.0".toDouble res22: Double = 2.0 {code} > SparkSQL doesn't parse decimal like Hive > > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +++ > |column_1|column_2| > +++ > | a|null| > |null|3.00| > +++ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't > have a similar parsing behavior for decimals. I wouldn't say it is a bug per > se, but it looks like a necessary improvement for the two engines to > converge. Hive version is 1.5.1 > Not sure if relevant, but Scala does parse numbers with leading space > correctly > {code} > scala> "2.0".toDouble > res21: Double = 2.0 > scala> " 2.0".toDouble > res22: Double = 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14586: Description: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge. Hive version is 1.5.1 Not sure if relevant, but Scala does parse numbers with leading space correctly scala> "2.0".toDouble res21: Double = 2.0 scala> " 2.0".toDouble res22: Double = 2.0 was: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge. Hive version is 1.5.1 > SparkSQL doesn't parse decimal like Hive > > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +++ > |column_1|column_2| > +++ > | a|null| > |null|3.00| > +++ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a > similar parsing behavior for decimals. I wouldn't say it is a bug per se, but > it looks like a necessary improvement for the two engines to converge. Hive > version is 1.5.1 > Not sure if relevant, but Scala does parse numbers with leading space > correctly > scala> "2.0".toDouble > res21: Double = 2.0 > scala> " 2.0".toDouble > res22: Double = 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14586: Description: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge. Hive version is 1.5.1 Not sure if relevant, but Scala does parse numbers with leading space correctly {code} scala> "2.0".toDouble res21: Double = 2.0 scala> " 2.0".toDouble res22: Double = 2.0 {code} was: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge. Hive version is 1.5.1 Not sure if relevant, but Scala does parse numbers with leading space correctly scala> "2.0".toDouble res21: Double = 2.0 scala> " 2.0".toDouble res22: Double = 2.0 > SparkSQL doesn't parse decimal like Hive > > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +++ > |column_1|column_2| > +++ > | a|null| > |null|3.00| > +++ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a > similar parsing behavior for decimals. I wouldn't say it is a bug per se, but > it looks like a necessary improvement for the two engines to converge. Hive > version is 1.5.1 > Not sure if relevant, but Scala does parse numbers with leading space > correctly > {code} > scala> "2.0".toDouble > res21: Double = 2.0 > scala> " 2.0".toDouble > res22: Double = 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14586: Description: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge. Hive version is 1.5.1 was: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge > SparkSQL doesn't parse decimal like Hive > > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +++ > |column_1|column_2| > +++ > | a|null| > |null|3.00| > +++ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a > similar parsing behavior for decimals. I wouldn't say it is a bug per se, but > it looks like a necessary improvement for the two engines to converge. Hive > version is 1.5.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14583) SparkSQL doesn't read hive table properly after MSCK REPAIR
[ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14583: Summary: SparkSQL doesn't read hive table properly after MSCK REPAIR (was: Spark doesn't read hive table properly after MSCK REPAIR) > SparkSQL doesn't read hive table properly after MSCK REPAIR > --- > > Key: SPARK-14583 > URL: https://issues.apache.org/jira/browse/SPARK-14583 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > it seems that Spark forgets or fails to read the metadata tblproperties after > a MSCK REPAIR is issued from within HIVE > Here are the steps to reproduce: > create test_data.csv with the following content: > {code:none} > a,2 > ,3 > {code} > move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ > run the following hive statements: > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv; > CREATE EXTERNAL TABLE `spark_testing.test_csv`( > column_1 varchar(10), > column_2 int) > PARTITIONED BY ( > `part_a` string, > `part_b` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing' > TBLPROPERTIES('serialization.null.format'=''); > MSCK REPAIR TABLE spark_testing.test_csv; > select * from spark_testing.test_csv; > OK > a 2 a b > NULL3 a b > {code} > (you can see the NULL) > now onto Spark: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv").show() > +++--+--+ > |column_1|column_2|part_a|part_b| > +++--+--+ > | a| 2| a| b| > || 3| a| b| > +++--+--+ > {code} > As you can see, SPARK can't detect the null. > I don't know if it affects future versions of SPARK and I can't test it in my > company's environment. Steps are easy to reproduce though so can be tested in > other environments. My hive version is 1.2.1 > Let me know if you have any questions. To me that's a big issue because data > isn't read correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14586: Description: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge was: create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge > SparkSQL doesn't parse decimal like Hive > > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +++ > |column_1|column_2| > +++ > | a|null| > |null|3.00| > +++ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a > similar parsing behavior for decimals. I wouldn't say it is a bug per se, but > it looks like a necessary improvement for the two engines to converge -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
Stephane Maarek created SPARK-14586: --- Summary: SparkSQL doesn't parse decimal like Hive Key: SPARK-14586 URL: https://issues.apache.org/jira/browse/SPARK-14586 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Stephane Maarek create a test_data.csv with the following {code:none} a, 2.0 ,3.0 {code} (the space is intended before the 2) copy the test_data.csv to hdfs:///spark_testing_2 go in hive, run the following statements CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv_2; CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( column_1 varchar(10), column_2 decimal(4,2)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing_2' TBLPROPERTIES('serialization.null.format'=''); select * from spark_testing.test_csv_2; OK a 2 NULL3 {code} As you can see, the value " 2" gets parsed correctly to 2 Now onto Spark-shell: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv_2").show() +++ |column_1|column_2| +++ | a|null| |null|3.00| +++ {code} As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR
[ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14583: Description: it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE Here are the steps to reproduce: create test_data.csv with the following content: {code:none} a,2 ,3 {code} move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ run the following hive statements: {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv; CREATE EXTERNAL TABLE `spark_testing.test_csv`( column_1 varchar(10), column_2 int) PARTITIONED BY ( `part_a` string, `part_b` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing' TBLPROPERTIES('serialization.null.format'=''); MSCK REPAIR TABLE spark_testing.test_csv; select * from spark_testing.test_csv; OK a 2 a b NULL3 a b {code} (you can see the NULL) now onto Spark: {code:java} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv").show() +++--+--+ |column_1|column_2|part_a|part_b| +++--+--+ | a| 2| a| b| || 3| a| b| +++--+--+ {code} As you can see, SPARK can't detect the null. I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1 Let me know if you have any questions. To me that's a big issue because data isn't read correctly. was: it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE Here are the steps to reproduce: create test_data.csv with the following content: {code:csv} a,2 ,3 {code} move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ run the following hive statements: {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv; CREATE EXTERNAL TABLE `spark_testing.test_csv`( column_1 varchar(10), column_2 int) PARTITIONED BY ( `part_a` string, `part_b` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing' TBLPROPERTIES('serialization.null.format'=''); MSCK REPAIR TABLE spark_testing.test_csv; select * from spark_testing.test_csv; OK a 2 a b NULL3 a b {code} (you can see the NULL) now onto Spark: {code:scala} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv").show() +++--+--+ |column_1|column_2|part_a|part_b| +++--+--+ | a| 2| a| b| || 3| a| b| +++--+--+ {code} As you can see, SPARK can't detect the null. I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1 Let me know if you have any questions. To me that's a big issue because data isn't read correctly. > Spark doesn't read hive table properly after MSCK REPAIR > > > Key: SPARK-14583 > URL: https://issues.apache.org/jira/browse/SPARK-14583 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > it seems that Spark forgets or fails to read the metadata tblproperties after > a MSCK REPAIR is issued from within HIVE > Here are the steps to reproduce: > create test_data.csv with the following content: > {code:none} > a,2 > ,3 > {code} > move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ > run the following hive statements: > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv; > CREATE EXTERNAL TABLE `spark_testing.test_csv`( > column_1 varchar(10), > column_2 int) > PARTITIONED BY ( > `part_a` string, > `part_b` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing' > TBLPROPERTIES('serialization.null.format'=''); > MSCK REPAIR TABLE spark_testing.test_csv; > select * from spark_testing.test_csv; > OK > a 2 a b > NULL3 a b > {code} > (you can see the NULL) > now onto Spark: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv").show() > +++--+--+ > |column_1|column_2|part_a|part_b| > ++-
[jira] [Updated] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR
[ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephane Maarek updated SPARK-14583: Description: it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE Here are the steps to reproduce: create test_data.csv with the following content: {code:csv} a,2 ,3 {code} move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ run the following hive statements: {code:sql} CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv; CREATE EXTERNAL TABLE `spark_testing.test_csv`( column_1 varchar(10), column_2 int) PARTITIONED BY ( `part_a` string, `part_b` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing' TBLPROPERTIES('serialization.null.format'=''); MSCK REPAIR TABLE spark_testing.test_csv; select * from spark_testing.test_csv; OK a 2 a b NULL3 a b {code} (you can see the NULL) now onto Spark: {code:scala} val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("select * from spark_testing.test_csv").show() +++--+--+ |column_1|column_2|part_a|part_b| +++--+--+ | a| 2| a| b| || 3| a| b| +++--+--+ {code} As you can see, SPARK can't detect the null. I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1 Let me know if you have any questions. To me that's a big issue because data isn't read correctly. was: it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE Here are the steps to reproduce: create test_data.csv with the following content: a,2 ,3 move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ run the following hive statements: CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv; CREATE EXTERNAL TABLE `spark_testing.test_csv`( column_1 varchar(10), column_2 int) PARTITIONED BY ( `part_a` string, `part_b` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing' TBLPROPERTIES('serialization.null.format'=''); MSCK REPAIR TABLE spark_testing.test_csv; select * from spark_testing.test_csv; OK a 2 a b NULL3 a b (you can see the NULL) now onto Spark: +++--+--+ |column_1|column_2|part_a|part_b| +++--+--+ | a| 2| a| b| || 3| a| b| +++--+--+ As you can see, SPARK can't detect the null. I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1 Let me know if you have any questions. To me that's a big issue because data isn't read correctly. > Spark doesn't read hive table properly after MSCK REPAIR > > > Key: SPARK-14583 > URL: https://issues.apache.org/jira/browse/SPARK-14583 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > it seems that Spark forgets or fails to read the metadata tblproperties after > a MSCK REPAIR is issued from within HIVE > Here are the steps to reproduce: > create test_data.csv with the following content: > {code:csv} > a,2 > ,3 > {code} > move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ > run the following hive statements: > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv; > CREATE EXTERNAL TABLE `spark_testing.test_csv`( > column_1 varchar(10), > column_2 int) > PARTITIONED BY ( > `part_a` string, > `part_b` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing' > TBLPROPERTIES('serialization.null.format'=''); > MSCK REPAIR TABLE spark_testing.test_csv; > select * from spark_testing.test_csv; > OK > a 2 a b > NULL3 a b > {code} > (you can see the NULL) > now onto Spark: > {code:scala} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv").show() > +++--+--+ > |column_1|column_2|part_a|part_b| > +++--+--+ > | a| 2| a| b| > || 3| a| b| > +++--+--+ > {code} > As you can see, SPARK can't detect the null
[jira] [Commented] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR
[ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238352#comment-15238352 ] Stephane Maarek commented on SPARK-14583: - pretty much the same behavior if instead of MSCK REPAIR we run ALTER TABLE spark_testing.test_csv ADD PARTITION (part_a="a", part_b="b"); This makes me believe it's the partitioning that makes Spark fail > Spark doesn't read hive table properly after MSCK REPAIR > > > Key: SPARK-14583 > URL: https://issues.apache.org/jira/browse/SPARK-14583 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > it seems that Spark forgets or fails to read the metadata tblproperties after > a MSCK REPAIR is issued from within HIVE > Here are the steps to reproduce: > create test_data.csv with the following content: > a,2 > ,3 > move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ > run the following hive statements: > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv; > CREATE EXTERNAL TABLE `spark_testing.test_csv`( > column_1 varchar(10), > column_2 int) > PARTITIONED BY ( > `part_a` string, > `part_b` string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing' > TBLPROPERTIES('serialization.null.format'=''); > MSCK REPAIR TABLE spark_testing.test_csv; > select * from spark_testing.test_csv; > OK > a 2 a b > NULL3 a b > (you can see the NULL) > now onto Spark: > +++--+--+ > |column_1|column_2|part_a|part_b| > +++--+--+ > | a| 2| a| b| > || 3| a| b| > +++--+--+ > As you can see, SPARK can't detect the null. > I don't know if it affects future versions of SPARK and I can't test it in my > company's environment. Steps are easy to reproduce though so can be tested in > other environments. My hive version is 1.2.1 > Let me know if you have any questions. To me that's a big issue because data > isn't read correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR
Stephane Maarek created SPARK-14583: --- Summary: Spark doesn't read hive table properly after MSCK REPAIR Key: SPARK-14583 URL: https://issues.apache.org/jira/browse/SPARK-14583 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.5.1 Reporter: Stephane Maarek it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE Here are the steps to reproduce: create test_data.csv with the following content: a,2 ,3 move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/ run the following hive statements: CREATE SCHEMA IF NOT EXISTS spark_testing; DROP TABLE IF EXISTS spark_testing.test_csv; CREATE EXTERNAL TABLE `spark_testing.test_csv`( column_1 varchar(10), column_2 int) PARTITIONED BY ( `part_a` string, `part_b` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/spark_testing' TBLPROPERTIES('serialization.null.format'=''); MSCK REPAIR TABLE spark_testing.test_csv; select * from spark_testing.test_csv; OK a 2 a b NULL3 a b (you can see the NULL) now onto Spark: +++--+--+ |column_1|column_2|part_a|part_b| +++--+--+ | a| 2| a| b| || 3| a| b| +++--+--+ As you can see, SPARK can't detect the null. I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1 Let me know if you have any questions. To me that's a big issue because data isn't read correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229401#comment-15229401 ] Stephane Maarek commented on SPARK-11374: - I may add that more metadata isn't processed, namely TBLPROPERTIES ('serialization.null.format'='') Also, another issue (may still be related to Spark not reading Hive Metadata or not properly using Hive), but if you create a csv with the following (spaces intended) 1, 2,3 4, 5,6 use Hive as this: CREATE EXTERNAL TABLE `my_table`( `c1` DECIMAL, `c2` DECIMAL, `c3` DECIMAL) ... etc select * from my_table will return in Hive 1,2,3 4,5,6 But using a hive context, in Spark 1,null,3 4,null,6 > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227688#comment-15227688 ] Stephane Maarek commented on SPARK-12741: - Hi, May be related to: http://stackoverflow.com/questions/36438898/spark-dataframe-count-doesnt-return-the-same-results-when-run-twice I don't have code to generate the input file, it's just a simple hive table though. Cheers, Stephane > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324349#comment-14324349 ] Stephane Maarek commented on SPARK-5480: Hi Sean, We have included the following code before and after the graph gets created: println(s"Vertices count ${vertices.count}") println(s"Edges count ${edges.count}") val defaultArticle = ("Missing", None, List.empty, None) // create the graph, making sure we default to a defaultArticle when we have a missing relation (prevents nulls) val graph = Graph(vertices, edges, defaultArticle).cache println(s"After graph: Vertices count ${graph.vertices.count}") println(s"After graph: Edges count ${graph.edges.count}") What we see on multiple runs with exact same configuration is that the count of edges and nodes before the graph is created is always the same. The constant: Vertices count: 192190 Edges count: 4582582 After graph: (trial one - generated the error) After graph: Vertices count: 2450854 After graph: Edges count: 4188635 (trial two - terminated correctly) After graph: Vertices count: 2450854 After graph: Edges count: 4582582 (trial three - generated the error) After graph: Vertices count: 2450854 After graph: Edges count: 4000218 As we can replicate the issue, please let us know if we should add any code to help you debug. Our code is deterministic, so before creating the graph we always see the same output. What's odd is that after creating the graph, the vertices count is constant, but the edges count varies > GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: > --- > > Key: SPARK-5480 > URL: https://issues.apache.org/jira/browse/SPARK-5480 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0 > Environment: Yarn client >Reporter: Stephane Maarek > > Running the following code: > val subgraph = graph.subgraph ( > vpred = (id,article) => //working predicate) > ).cache() > println( s"Subgraph contains ${subgraph.vertices.count} nodes and > ${subgraph.edges.count} edges") > val prGraph = subgraph.staticPageRank(5).cache > val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { > (v, title, rank) => (rank.getOrElse(0.0), title) > } > titleAndPrGraph.vertices.top(13) { > Ordering.by((entry: (VertexId, (Double, _))) => entry._2._1) > }.foreach(t => println(t._2._2._1 + ": " + t._2._1 + ", id:" + t._1)) > Returns a graph with 5000 nodes and 4000 edges. > Then it crashes during the PageRank with the following: > 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage > 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) > 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage > 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) > at > org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) > at > org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) > at > org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) > at > org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.s
[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307628#comment-14307628 ] Stephane Maarek commented on SPARK-5480: It happened once after one of my server failed, but the graph vertices and edges count did work. Doesn't happen systematically... having issues reproducing it val subgraph = graph.subgraph ( vpred = (id,article) => article._1.toLowerCase.contains(stringToSearchFor) || article._3.exists(keyword => keyword.contains(stringToSearchFor)) || (article._2 match { case None => false case Some(articleAbstract) => articleAbstract.toLowerCase.contains(stringToSearchFor) }) ).cache() > GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: > --- > > Key: SPARK-5480 > URL: https://issues.apache.org/jira/browse/SPARK-5480 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0 > Environment: Yarn client >Reporter: Stephane Maarek > > Running the following code: > val subgraph = graph.subgraph ( > vpred = (id,article) => //working predicate) > ).cache() > println( s"Subgraph contains ${subgraph.vertices.count} nodes and > ${subgraph.edges.count} edges") > val prGraph = subgraph.staticPageRank(5).cache > val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { > (v, title, rank) => (rank.getOrElse(0.0), title) > } > titleAndPrGraph.vertices.top(13) { > Ordering.by((entry: (VertexId, (Double, _))) => entry._2._1) > }.foreach(t => println(t._2._2._1 + ": " + t._2._1 + ", id:" + t._1)) > Returns a graph with 5000 nodes and 4000 edges. > Then it crashes during the PageRank with the following: > 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage > 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) > 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage > 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) > at > org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) > at > org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) > at > org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) > at > org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at > org.apache.spark.rdd.MapParti
[jira] [Created] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
Stephane Maarek created SPARK-5480: -- Summary: GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: Key: SPARK-5480 URL: https://issues.apache.org/jira/browse/SPARK-5480 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Environment: Yarn client Reporter: Stephane Maarek Running the following code: val subgraph = graph.subgraph ( vpred = (id,article) => //working predicate) ).cache() println( s"Subgraph contains ${subgraph.vertices.count} nodes and ${subgraph.edges.count} edges") val prGraph = subgraph.staticPageRank(5).cache val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { (v, title, rank) => (rank.getOrElse(0.0), title) } titleAndPrGraph.vertices.top(13) { Ordering.by((entry: (VertexId, (Double, _))) => entry._2._1) }.foreach(t => println(t._2._2._1 + ": " + t._2._1 + ", id:" + t._1)) Returns a graph with 5000 nodes and 4000 edges. Then it crashes during the PageRank with the following: 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: