[jira] [Updated] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pin_zhang updated SPARK-17932: -- Description: SQL "show table extended like table_name " doesn't work in spark 2.0.0 that works in spark1.5.2 Error: org.apache.spark.sql.catalyst.parser.ParseException: missing 'FUNCTIONS' at 'extended'(line 1, pos 11) == SQL == show table extended like test ---^^^ (state=,code=0) was: SQL "show table extended like table_name " doesn't work in spark 2.0.0 that works in spark1.5.2 > Failed to run SQL "show table extended like table_name" in Spark2.0.0 > --- > > Key: SPARK-17932 > URL: https://issues.apache.org/jira/browse/SPARK-17932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: pin_zhang > > SQL "show table extended like table_name " doesn't work in spark 2.0.0 > that works in spark1.5.2 > Error: org.apache.spark.sql.catalyst.parser.ParseException: > missing 'FUNCTIONS' at 'extended'(line 1, pos 11) > == SQL == > show table extended like test > ---^^^ (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces
[ https://issues.apache.org/jira/browse/SPARK-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17925. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15473 [https://github.com/apache/spark/pull/15473] > Break fileSourceInterfaces.scala into multiple pieces > - > > Key: SPARK-17925 > URL: https://issues.apache.org/jira/browse/SPARK-17925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > fileSourceInterfaces.scala has grown fairly large, making it difficult to > understand. To facilitate code review, I'd submit a separate patch to break > this into multiple pieces first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574339#comment-15574339 ] Chirag Vaya commented on SPARK-10063: - [~mkim] Can you please tell us in what environment(Standalone Spark on single node or multiple nodes or AWS EMR) were you using direct output committer ? According to [~rxin], any environment that has network partition (e.g. AWS EMR) would lead to inconsistencies. Please correct me if i am wrong on this. > Remove DirectParquetOutputCommitter > --- > > Key: SPARK-10063 > URL: https://issues.apache.org/jira/browse/SPARK-10063 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Reynold Xin >Priority: Critical > Fix For: 2.0.0 > > > When we use DirectParquetOutputCommitter on S3 and speculation is enabled, > there is a chance that we can loss data. > Here is the code to reproduce the problem. > {code} > import org.apache.spark.sql.functions._ > val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: > Int, partitionId: Int, attemptNumber: Int) => { > if (partitionId == 0 && i == 5) { > if (attemptNumber > 0) { > Thread.sleep(15000) > throw new Exception("new exception") > } else { > Thread.sleep(1) > } > } > > i > }) > val df = sc.parallelize((1 to 100), 20).mapPartitions { iter => > val context = org.apache.spark.TaskContext.get() > val partitionId = context.partitionId > val attemptNumber = context.attemptNumber > iter.map(i => (i, partitionId, attemptNumber)) > }.toDF("i", "partitionId", "attemptNumber") > df > .select(failSpeculativeTask($"i", $"partitionId", > $"attemptNumber").as("i"), $"partitionId", $"attemptNumber") > .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter") > sqlContext.read.load("/home/yin/outputCommitter").count > // The result is 99 and 5 is missing from the output. > {code} > What happened is that the original task finishes first and uploads its output > file to S3, then the speculative task somehow fails. Because we have to call > output stream's close method, which uploads data to S3, we actually uploads > the partial result generated by the failed speculative task to S3 and this > file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0
pin_zhang created SPARK-17932: - Summary: Failed to run SQL "show table extended like table_name" in Spark2.0.0 Key: SPARK-17932 URL: https://issues.apache.org/jira/browse/SPARK-17932 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: pin_zhang SQL "show table extended like table_name " doesn't work in spark 2.0.0 that works in spark1.5.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17884) In the cast expression, casting from empty string to interval type throws NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-17884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574305#comment-15574305 ] Apache Spark commented on SPARK-17884: -- User 'priyankagargnitk' has created a pull request for this issue: https://github.com/apache/spark/pull/15479 > In the cast expression, casting from empty string to interval type throws > NullPointerException > -- > > Key: SPARK-17884 > URL: https://issues.apache.org/jira/browse/SPARK-17884 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Priyanka Garg >Assignee: Priyanka Garg > Fix For: 2.0.2, 2.1.0 > > > When the cast expression is applied on empty string "" to cast it to interval > type it throws Null pointer exception.. > Getting the same exception when I tried reproducing the same through test case > checkEvaluation(Cast(Literal(""), CalendarIntervalType), null) > Exception i am getting is: > java.lang.NullPointerException was thrown. > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:254) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvalutionWithUnsafeProjection(ExpressionEvalHelper.scala:181) > at > org.apache.spark.sql.catalyst.expressions.CastSuite.checkEvalutionWithUnsafeProjection(CastSuite.scala:33) > at > org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper$class.checkEvaluation(ExpressionEvalHelper.scala:64) > at > org.apache.spark.sql.catalyst.expressions.CastSuite.checkEvaluation(CastSuite.scala:33) > at > org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply$mcV$sp(CastSuite.scala:770) > at > org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply(CastSuite.scala:767) > at > org.apache.spark.sql.catalyst.expressions.CastSuite$$anonfun$22.apply(CastSuite.scala:767) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeA
[jira] [Updated] (SPARK-17254) Filter operator should have “stop if false” semantics for sorted data
[ https://issues.apache.org/jira/browse/SPARK-17254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-17254: Attachment: stop-after-physical-plan.pdf > Filter operator should have “stop if false” semantics for sorted data > - > > Key: SPARK-17254 > URL: https://issues.apache.org/jira/browse/SPARK-17254 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tejas Patil >Priority: Minor > Attachments: stop-after-physical-plan.pdf > > > From > https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf: > Filter on sorted data > If the data is sorted by a key, filters on the key could stop as soon as the > data is out of range. For example, WHERE ticker_id < “F” should stop as soon > as the first row starting with “F” is seen. This can be done adding a Filter > operator that has “stop if false” semantics. This is generally useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables
[ https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574251#comment-15574251 ] Dongjoon Hyun commented on SPARK-16632: --- This was backported at the following commit. https://github.com/apache/spark/commit/f9367d6 > Vectorized parquet reader fails to read certain fields from Hive tables > --- > > Key: SPARK-16632 > URL: https://issues.apache.org/jira/browse/SPARK-16632 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Hive 1.1 (CDH) >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.0.1, 2.1.0 > > > The vectorized parquet reader fails to read certain tables created by Hive. > When the tables have type "tinyint" or "smallint", Catalyst converts those to > "ByteType" and "ShortType" respectively. But when Hive writes those tables in > parquet format, the parquet schema in the files contains "int32" fields. > To reproduce, run these commands in the hive shell (or beeline): > {code} > create table abyte (value tinyint) stored as parquet; > create table ashort (value smallint) stored as parquet; > insert into abyte values (1); > insert into ashort values (1); > {code} > Then query them with Spark 2.0: > {code} > spark.sql("select * from abyte").show(); > spark.sql("select * from ashort").show(); > {code} > You'll see this exception (for the byte case): > {noformat} > 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: > Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {noformat} > This works when you point Spark directly at the files (instead of using the > metastore data), or when you disable the vectorized parquet reader. > The root cause seems to be that Hive creates these tables with a > not-so-complete schema: > {noformat} > $ parquet-tools schema /tmp/byte.parquet > message hive_schema { > optional int32 value; > } > {noformat} > There's no indication that the field is a 32-bit field used to store 8-bit > values. When the ParquetReadSupport code tries to consolidate both schemas, > it just chooses whatever is in the parquet file for primitive types (see > ParquetReadSupport.clipParquetType); the vectorized reader uses the catalyst > schema, which comes from the Hive metastore, and says it's a byte field, so > when it tries to read the data, the byte data stored in "OnHeapColumnVector" > is null. > I have tested a small chang
[jira] [Commented] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables
[ https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574253#comment-15574253 ] Dongjoon Hyun commented on SPARK-16632: --- {code} spark-2.0:branch-2.0$ git log --oneline | grep SPARK-16632 933d76a [SPARK-16632][SQL] Revert PR #14272: Respect Hive schema when merging parquet schema f9367d6 [SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization c2b5b3c [SPARK-16632][SQL] Respect Hive schema when merging parquet schema. {code} > Vectorized parquet reader fails to read certain fields from Hive tables > --- > > Key: SPARK-16632 > URL: https://issues.apache.org/jira/browse/SPARK-16632 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Hive 1.1 (CDH) >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.0.1, 2.1.0 > > > The vectorized parquet reader fails to read certain tables created by Hive. > When the tables have type "tinyint" or "smallint", Catalyst converts those to > "ByteType" and "ShortType" respectively. But when Hive writes those tables in > parquet format, the parquet schema in the files contains "int32" fields. > To reproduce, run these commands in the hive shell (or beeline): > {code} > create table abyte (value tinyint) stored as parquet; > create table ashort (value smallint) stored as parquet; > insert into abyte values (1); > insert into ashort values (1); > {code} > Then query them with Spark 2.0: > {code} > spark.sql("select * from abyte").show(); > spark.sql("select * from ashort").show(); > {code} > You'll see this exception (for the byte case): > {noformat} > 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: > Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {noformat} > This works when you point Spark directly at the files (instead of using the > metastore data), or when you disable the vectorized parquet reader. > The root cause seems to be that Hive creates these tables with a > not-so-complete schema: > {noformat} > $ parquet-tools schema /tmp/byte.parquet > message hive_schema { > optional int32 value; > } > {noformat} > There's no indication that the field is a 32-bit field used to store 8-bit > values. When the ParquetReadSupport code tries to consolidate both schemas, > it just chooses whatever is in the parquet file for primitive types (see > ParquetReadSupport.cli
[jira] [Updated] (SPARK-16632) Vectorized parquet reader fails to read certain fields from Hive tables
[ https://issues.apache.org/jira/browse/SPARK-16632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16632: -- Fix Version/s: 2.0.1 > Vectorized parquet reader fails to read certain fields from Hive tables > --- > > Key: SPARK-16632 > URL: https://issues.apache.org/jira/browse/SPARK-16632 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Hive 1.1 (CDH) >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.0.1, 2.1.0 > > > The vectorized parquet reader fails to read certain tables created by Hive. > When the tables have type "tinyint" or "smallint", Catalyst converts those to > "ByteType" and "ShortType" respectively. But when Hive writes those tables in > parquet format, the parquet schema in the files contains "int32" fields. > To reproduce, run these commands in the hive shell (or beeline): > {code} > create table abyte (value tinyint) stored as parquet; > create table ashort (value smallint) stored as parquet; > insert into abyte values (1); > insert into ashort values (1); > {code} > Then query them with Spark 2.0: > {code} > spark.sql("select * from abyte").show(); > spark.sql("select * from ashort").show(); > {code} > You'll see this exception (for the byte case): > {noformat} > 16/07/13 12:24:23 ERROR datasources.InsertIntoHadoopFsRelationCommand: > Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3, scm-centos71-iqalat-2.gce.cloudera.com): > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getByte(OnHeapColumnVector.java:159) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {noformat} > This works when you point Spark directly at the files (instead of using the > metastore data), or when you disable the vectorized parquet reader. > The root cause seems to be that Hive creates these tables with a > not-so-complete schema: > {noformat} > $ parquet-tools schema /tmp/byte.parquet > message hive_schema { > optional int32 value; > } > {noformat} > There's no indication that the field is a 32-bit field used to store 8-bit > values. When the ParquetReadSupport code tries to consolidate both schemas, > it just chooses whatever is in the parquet file for primitive types (see > ParquetReadSupport.clipParquetType); the vectorized reader uses the catalyst > schema, which comes from the Hive metastore, and says it's a byte field, so > when it tries to read the data, the byte data stored in "OnHeapColumnVector" > is null. > I have tested a small change to {{ParquetReadSupport.clipParquetType}} that > fixes this particular issue, but I haven't run any other tests, so I'll do
[jira] [Resolved] (SPARK-17927) Remove dead code in WriterContainer
[ https://issues.apache.org/jira/browse/SPARK-17927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17927. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15477 [https://github.com/apache/spark/pull/15477] > Remove dead code in WriterContainer > --- > > Key: SPARK-17927 > URL: https://issues.apache.org/jira/browse/SPARK-17927 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574137#comment-15574137 ] Felix Cheung commented on SPARK-17781: -- Hmm.. I'm not quite sure what it is just yet - not seeing that in just R. > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17931) taskScheduler has some unneeded serialization
Guoqiang Li created SPARK-17931: --- Summary: taskScheduler has some unneeded serialization Key: SPARK-17931 URL: https://issues.apache.org/jira/browse/SPARK-17931 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Guoqiang Li When taskScheduler instantiates TaskDescription, it calls `Task.serializeWithDependencies(task, sched.sc.addedFiles, sched.sc.addedJars, ser)`. It serializes task and its dependency. But after SPARK-2521 has been merged into the master, the ResultTask class and ShuffleMapTask class no longer contain rdd and closure objects. TaskDescription class can be changed as below: {noformat} class TaskDescription[T]( val taskId: Long, val attemptNumber: Int, val executorId: String, val name: String, val index: Int, val task: Task[T]) extends Serializable {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused
Guoqiang Li created SPARK-17930: --- Summary: The SerializerInstance instance used when deserializing a TaskResult is not reused Key: SPARK-17930 URL: https://issues.apache.org/jira/browse/SPARK-17930 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.0.1, 1.6.1 Reporter: Guoqiang Li The following code is called when the DirectTaskResult instance is deserialized {noformat} def value(): T = { if (valueObjectDeserialized) { valueObject } else { // Each deserialization creates a new instance of SerializerInstance, which is very time-consuming val resultSer = SparkEnv.get.serializer.newInstance() valueObject = resultSer.deserialize(valueBytes) valueObjectDeserialized = true valueObject } } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset
[ https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weizhong updated SPARK-17929: - Summary: Deadlock when AM restart and send RemoveExecutor on reset (was: Deadlock when AM restart send RemoveExecutor) > Deadlock when AM restart and send RemoveExecutor on reset > - > > Key: SPARK-17929 > URL: https://issues.apache.org/jira/browse/SPARK-17929 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Weizhong >Priority: Minor > > We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala > {code} > protected def reset(): Unit = synchronized { > numPendingExecutors = 0 > executorsPendingToRemove.clear() > // Remove all the lingering executors that should be removed but not yet. > The reason might be > // because (1) disconnected event is not yet received; (2) executors die > silently. > executorDataMap.toMap.foreach { case (eid, _) => > driverEndpoint.askWithRetry[Boolean]( > RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager > re-registered."))) > } > } > {code} > but on removeExecutor also need the lock > "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, > and send RPC will failed, and reset failed > {code} > private def removeExecutor(executorId: String, reason: > ExecutorLossReason): Unit = { > logDebug(s"Asked to remove executor $executorId with reason $reason") > executorDataMap.get(executorId) match { > case Some(executorInfo) => > // This must be synchronized because variables mutated > // in this block are read when requesting executors > val killed = CoarseGrainedSchedulerBackend.this.synchronized { > addressToExecutorId -= executorInfo.executorAddress > executorDataMap -= executorId > executorsPendingLossReason -= executorId > executorsPendingToRemove.remove(executorId).getOrElse(false) > } > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17929) Deadlock when AM restart send RemoveExecutor
Weizhong created SPARK-17929: Summary: Deadlock when AM restart send RemoveExecutor Key: SPARK-17929 URL: https://issues.apache.org/jira/browse/SPARK-17929 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Weizhong Priority: Minor We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala {code} protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } {code} but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, and send RPC will failed, and reset failed {code} private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17899) add a debug mode to keep raw table properties in HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-17899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573961#comment-15573961 ] Apache Spark commented on SPARK-17899: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15478 > add a debug mode to keep raw table properties in HiveExternalCatalog > > > Key: SPARK-17899 > URL: https://issues.apache.org/jira/browse/SPARK-17899 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-12664: --- Assignee: Yanbo Liang > Expose raw prediction scores in MultilayerPerceptronClassificationModel > --- > > Key: SPARK-12664 > URL: https://issues.apache.org/jira/browse/SPARK-12664 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Robert Dodier >Assignee: Yanbo Liang > > In > org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, > there isn't any way to get raw prediction scores; only an integer output > (from 0 to #classes - 1) is available via the `predict` method. > `mplModel.predict` is called within the class to get the raw score, but > `mlpModel` is private so that isn't available to outside callers. > The raw score is useful when the user wants to interpret the classifier > output as a probability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17928) No driver.memoryOverhead setting for mesos cluster mode
Drew Robb created SPARK-17928: - Summary: No driver.memoryOverhead setting for mesos cluster mode Key: SPARK-17928 URL: https://issues.apache.org/jira/browse/SPARK-17928 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.0.1 Reporter: Drew Robb Mesos cluster mode does not have a configuration setting for the driver's memory overhead. This makes scheduling long running drivers on mesos using dispatcher very unreliable. There is an equivalent setting for yarn-- spark.yarn.driver.memoryOverhead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17898) --repositories needs username and password
[ https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573929#comment-15573929 ] lichenglin edited comment on SPARK-17898 at 10/14/16 2:41 AM: -- I know it. But how to build these dependencies into my jar. Could you give me an example with gradle?? Really thanks a lot. I have tried make a jar with mysql-jdbc.jar. And I have change the mainfest's classpath. It Failed. was (Author: licl): I know it. But how to build these dependencies into my jar. Could you give me an example with gradle?? Really thanks a lot. > --repositories needs username and password > --- > > Key: SPARK-17898 > URL: https://issues.apache.org/jira/browse/SPARK-17898 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.1 >Reporter: lichenglin > > My private repositories need username and password to visit. > I can't find a way to declaration the username and password when submit > spark application > {code} > bin/spark-submit --repositories > http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages > com.databricks:spark-csv_2.10:1.2.0 --class > org.apache.spark.examples.SparkPi --master local[8] > examples/jars/spark-examples_2.11-2.0.1.jar 100 > {code} > The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username > and password -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17898) --repositories needs username and password
[ https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573929#comment-15573929 ] lichenglin commented on SPARK-17898: I know it. But how to build these dependencies into my jar. Could you give me an example with gradle?? Really thanks a lot. > --repositories needs username and password > --- > > Key: SPARK-17898 > URL: https://issues.apache.org/jira/browse/SPARK-17898 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.1 >Reporter: lichenglin > > My private repositories need username and password to visit. > I can't find a way to declaration the username and password when submit > spark application > {code} > bin/spark-submit --repositories > http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages > com.databricks:spark-csv_2.10:1.2.0 --class > org.apache.spark.examples.SparkPi --master local[8] > examples/jars/spark-examples_2.11-2.0.1.jar 100 > {code} > The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username > and password -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573843#comment-15573843 ] Cody Koeninger commented on SPARK-17812: So I think this is what we're agreed on: Mutually exclusive subscription options (only assign is new to this ticket) {noformat} .option("subscribe","topicFoo,topicBar") .option("subscribePattern","topic.*") .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") {noformat} where assign can only be specified that way, no inline offsets Single starting position option with three mutually exclusive types of value {noformat} .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1": -2}, "topicBar":{"0": -1}}""") {noformat} startingOffsets with json fails if any topicpartition in the assignments doesn't have an offset. Sound right? I'll go ahead and start on it. I'm assuming I should try to reuse some of the existing catalyst Jackson stuff and keep in mind a format that's potentially usable by the checkpoints as well? I don't think earliest / latest is too unclear as long as there's a way to get to the other knobs that auto.offset.reset (should have) provided. Punting the tunability of new partitions to another ticket sounds good. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cody Koeninger > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17927) Remove dead code in WriterContainer
[ https://issues.apache.org/jira/browse/SPARK-17927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17927: Assignee: Apache Spark (was: Reynold Xin) > Remove dead code in WriterContainer > --- > > Key: SPARK-17927 > URL: https://issues.apache.org/jira/browse/SPARK-17927 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17927) Remove dead code in WriterContainer
[ https://issues.apache.org/jira/browse/SPARK-17927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573838#comment-15573838 ] Apache Spark commented on SPARK-17927: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15477 > Remove dead code in WriterContainer > --- > > Key: SPARK-17927 > URL: https://issues.apache.org/jira/browse/SPARK-17927 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17927) Remove dead code in WriterContainer
[ https://issues.apache.org/jira/browse/SPARK-17927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17927: Assignee: Reynold Xin (was: Apache Spark) > Remove dead code in WriterContainer > --- > > Key: SPARK-17927 > URL: https://issues.apache.org/jira/browse/SPARK-17927 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17927) Remove dead code in WriterContainer
Reynold Xin created SPARK-17927: --- Summary: Remove dead code in WriterContainer Key: SPARK-17927 URL: https://issues.apache.org/jira/browse/SPARK-17927 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17813) Maximum data per trigger
[ https://issues.apache.org/jira/browse/SPARK-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573806#comment-15573806 ] Cody Koeninger commented on SPARK-17813: So issues to be worked out here (assuming we're still ignoring compacted topics) maxOffsetsPerTrigger - how are these maximums distributed among partitions? What about skewed topics / partitions? maxOffsetsPerTopicPartitionPerTrigger - (this isn't just hypothetical, e.g. SPARK-17510) If we do this, how is this configuration communicated? {noformat} option("maxOffsetsPerTopicPartitionPerTrigger", """{"topicFoo":{"0":600}, "topicBar":{"0":300, "1": 600}}""") {noformat} {noformat} option("maxOffsetsPerTopicPerTrigger", """{"topicFoo": 600, "topicBar": 300}""") {noformat} > Maximum data per trigger > > > Key: SPARK-17813 > URL: https://issues.apache.org/jira/browse/SPARK-17813 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > At any given point in a streaming query execution, we process all available > data. This maximizes throughput at the cost of latency. We should add > something similar to the {{maxFilesPerTrigger}} option available for files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-17812: Assignee: Cody Koeninger > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Cody Koeninger > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17926) Add methods to convert StreamingQueryStatus to json
[ https://issues.apache.org/jira/browse/SPARK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573788#comment-15573788 ] Apache Spark commented on SPARK-17926: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/15476 > Add methods to convert StreamingQueryStatus to json > --- > > Key: SPARK-17926 > URL: https://issues.apache.org/jira/browse/SPARK-17926 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Useful for recording StreamingQueryStatuses when exposed through > StreamingQueryListener -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17926) Add methods to convert StreamingQueryStatus to json
[ https://issues.apache.org/jira/browse/SPARK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17926: Assignee: Apache Spark (was: Tathagata Das) > Add methods to convert StreamingQueryStatus to json > --- > > Key: SPARK-17926 > URL: https://issues.apache.org/jira/browse/SPARK-17926 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > > Useful for recording StreamingQueryStatuses when exposed through > StreamingQueryListener -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17926) Add methods to convert StreamingQueryStatus to json
[ https://issues.apache.org/jira/browse/SPARK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17926: Assignee: Tathagata Das (was: Apache Spark) > Add methods to convert StreamingQueryStatus to json > --- > > Key: SPARK-17926 > URL: https://issues.apache.org/jira/browse/SPARK-17926 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Useful for recording StreamingQueryStatuses when exposed through > StreamingQueryListener -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573786#comment-15573786 ] Michael Armbrust commented on SPARK-17812: -- Please do work on it. It might be good to update the description with a summary of this discussion so we can all be sure we are on the same page. I actually do think its fair to have one configuration for what to do in the case of data loss. This happens when you fall behind or when you come back and new partitions are there that have already aged out. Lets add this in another ticket. I know you are super deep in Kafka and other should chime in if I'm way off-base, but I think that {{startingOffsets=earliest}} and {{startingOffsets=latest}} is pretty clear what is happening. I would not change {{earliest}} and {{latest}} just to be different from kafka. We could make it query start if this is still confusing, but lets do that soon if that is the case. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17926) Add methods to convert StreamingQueryStatus to json
[ https://issues.apache.org/jira/browse/SPARK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-17926: -- Description: Useful for recording StreamingQueryStatuses when exposed through StreamingQueryListener > Add methods to convert StreamingQueryStatus to json > --- > > Key: SPARK-17926 > URL: https://issues.apache.org/jira/browse/SPARK-17926 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Useful for recording StreamingQueryStatuses when exposed through > StreamingQueryListener -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17926) Add methods to convert StreamingQueryStatus to json
Tathagata Das created SPARK-17926: - Summary: Add methods to convert StreamingQueryStatus to json Key: SPARK-17926 URL: https://issues.apache.org/jira/browse/SPARK-17926 Project: Spark Issue Type: Sub-task Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573766#comment-15573766 ] Cody Koeninger commented on SPARK-17812: OK, failing on start is clear (it's really annoying in the case of subscribePattern), but at least it's clear. I think that's enough to get started on this ticket, is anyone currently working on it or can I do it? Ryan seemed worried that it wouldn't get done in time for the next release. It sounds like your current plan is to ignore whatever comes out of KAFKA-3370, which is fine as long as whatever you do is both clear and equally tunable. Clarity of semantics can't be the only criterion of an API, "You can only start at latest offset, period" is clear, but a crap api. {quote} the only case where we lack sufficient tunability is "Where do I go when the current offsets are invalid due to retention?". {quote} No, you lack sufficient tunability as to where newly discovered partitions start. Keep in mind that those partitions may have been discovered after a significant job downtime. If the point of an API is to provide clear semantics to the user, it is not at all clear to me as a user how I can start those partitions at latest, which I know is possible in the underlying data model. The reason I'm belaboring this point now is that you have chosen names (earliest, latest) for the API currently under discussion that are confusingly similar to the existing auto offset reset functionality, and you have provided knobs for some, but not all, of the things auto offset reset currently affects. This is going to confuse people, it already confuses me. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17368) Scala value classes create encoder problems and break at runtime
[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-17368. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15284 [https://github.com/apache/spark/pull/15284] > Scala value classes create encoder problems and break at runtime > > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 >Reporter: Aris Vlasakakis > Fix For: 2.1.0 > > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v >+- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code for Spark 2.0.0: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ > spark.sparkContext.setLogLevel("warn") > val ds: Dataset[FeatureId] = spark.createDataset(seq) > println(s"BREAK HERE: ${ds.count}") > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573692#comment-15573692 ] Hossein Falaki commented on SPARK-17916: Thanks for linking it. Yes they are very much same issues. However, I slightly disagree with the proposed solution. I will comment on the PR. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573677#comment-15573677 ] Michael Armbrust commented on SPARK-17812: -- bq. with your proposed interface, what, as a user, do you expect to happen when you specify startingOffsets for some but not all partitions? I would probably opt to fail to start the query with advice on how to fix it (i.e. "specify {{-1}} for these partitions if you don't care"). We could also have a default, but I tend to error on the side of explicit behavior. bq. Yes, auto.offset.reset is a mess. Have you read https://issues.apache.org/jira/browse/KAFKA-3370 What are you going to do when that ticket is resolved? It should allow users to answer the questions you raised in very specific ways, that your interface does not. There is clearly a lot of confusing baggage with this configuration option, specifically because it is conflating too many unrelated concerns. Furthermore, IMHO {{auto.offset.reset}} is a pretty confusing name that does not imply anything about where in the stream this query should start. "reset" implies you were set somewhere to begin with. In contrast, {{startingOffsets}} handles one case clearly: it picks the offsets that are used as a starting point for the append only table abstraction that Spark is providing. As far as I understand the discussion on the ticket you referenced, the only case where we lack sufficient tunability is "Where do I go when the current offsets are invalid due to retention?". In this case, where data has been lost and {{failOnDataLoss=false}}, we currently try to minimize the amount of data we lose by starting at the earliest offsets available. We should certainly consider making this behavior configurable as well, but that seems like a different concern than what is being discussed in this JIRA. Personally, it seems like if you are falling so far behind that you have to skip all the way ahead, something is going very wrong. However, if users request this feature, we should certainly add it. I would not, however, shoe-horn it into anything having to do with query start behavior. It seems like they have reached a similar conclusion, as they are considering adding a new configuration, {{auto.reset.offset.existing}}. bq. Is the purpose of your interface to do what you think users should be able to do, or what they need to be able to do? The purpose of an interface is to provide clear semantics to the user. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573668#comment-15573668 ] Hyukjin Kwon commented on SPARK-17916: -- Hi [~falaki], this JIRA rings a bell to me. Do you mind if I ask to take a look https://github.com/apache/spark/pull/12904 and SPARK-15125. Could you confirm that it is a duplicate maybe? > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces
[ https://issues.apache.org/jira/browse/SPARK-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17925: Assignee: Apache Spark (was: Reynold Xin) > Break fileSourceInterfaces.scala into multiple pieces > - > > Key: SPARK-17925 > URL: https://issues.apache.org/jira/browse/SPARK-17925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > fileSourceInterfaces.scala has grown fairly large, making it difficult to > understand. To facilitate code review, I'd submit a separate patch to break > this into multiple pieces first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces
[ https://issues.apache.org/jira/browse/SPARK-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573640#comment-15573640 ] Apache Spark commented on SPARK-17925: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15473 > Break fileSourceInterfaces.scala into multiple pieces > - > > Key: SPARK-17925 > URL: https://issues.apache.org/jira/browse/SPARK-17925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > fileSourceInterfaces.scala has grown fairly large, making it difficult to > understand. To facilitate code review, I'd submit a separate patch to break > this into multiple pieces first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces
[ https://issues.apache.org/jira/browse/SPARK-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17925: Assignee: Reynold Xin (was: Apache Spark) > Break fileSourceInterfaces.scala into multiple pieces > - > > Key: SPARK-17925 > URL: https://issues.apache.org/jira/browse/SPARK-17925 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > fileSourceInterfaces.scala has grown fairly large, making it difficult to > understand. To facilitate code review, I'd submit a separate patch to break > this into multiple pieces first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17925) Break fileSourceInterfaces.scala into multiple pieces
Reynold Xin created SPARK-17925: --- Summary: Break fileSourceInterfaces.scala into multiple pieces Key: SPARK-17925 URL: https://issues.apache.org/jira/browse/SPARK-17925 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin fileSourceInterfaces.scala has grown fairly large, making it difficult to understand. To facilitate code review, I'd submit a separate patch to break this into multiple pieces first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17924) Consolidate streaming and batch write path
Reynold Xin created SPARK-17924: --- Summary: Consolidate streaming and batch write path Key: SPARK-17924 URL: https://issues.apache.org/jira/browse/SPARK-17924 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Structured streaming and normal SQL operation currently have two separate write path, leading to a lot of duplicated functions (that look similar) and if branches. The purpose of this ticket is to consolidate the two as much as possible to make the write path more clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17678) Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port"
[ https://issues.apache.org/jira/browse/SPARK-17678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17678: - Affects Version/s: (was: 1.6.3) 1.6.2 > Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" > > > Key: SPARK-17678 > URL: https://issues.apache.org/jira/browse/SPARK-17678 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.2 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 1.6.3 > > > Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" > configuration, so user cannot set a fixed port number through > "spark.replClassServer.port". > There's no issue in Spark2.0+, since this class is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17678) Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port"
[ https://issues.apache.org/jira/browse/SPARK-17678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17678. -- Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 1.6.3 > Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" > > > Key: SPARK-17678 > URL: https://issues.apache.org/jira/browse/SPARK-17678 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.3 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 1.6.3 > > > Spark 1.6 Scala-2.11 repl doesn't honor "spark.replClassServer.port" > configuration, so user cannot set a fixed port number through > "spark.replClassServer.port". > There's no issue in Spark2.0+, since this class is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573563#comment-15573563 ] Cody Koeninger commented on SPARK-17812: So a short term question - with your proposed interface, what, as a user, do you expect to happen when you specify startingOffsets for some but not all partitions? A couple of medium term questions: - Yes, auto.offset.reset is a mess. Have you read https://issues.apache.org/jira/browse/KAFKA-3370 - What are you going to do when that ticket is resolved? It should allow users to answer the questions you raised in very specific ways, that your interface does not. And a really leading long term question: - Is the purpose of your interface to do what you think users should be able to do, or what they need to be able to do? > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573526#comment-15573526 ] Michael Armbrust commented on SPARK-17812: -- As far as I understand it, {{auto.offset.reset}} is conflating a few things that make it hard for me to reason about exactly-once semantics in my query. It is answering all of the following: - Where do I start when I'm creating this {{group.id}} for the first time? - What do I do when a new partition is added to a topic I'm watching? - What do I do when the current offset is invalid because of retention? The model of structured streaming is an append only table, where we are computing the same answer incrementally as if you were running a batch query over all of the data in the table. The whole goal is to make it easy to reason about correctness and push the hard work of incremental processing and late data management into the optimizer / query planner. As a result, I think we are trying to answer a different set of questions than a distributed set of consumers that share a {{group.id}}: - Should this append only table contain all of the historical data available, or do I begin at this moment and start appending? This is what {{startingOffsets}} answers. I think we should handle {{"earliest"}} (all data), {{"latest"}} (only data that arrives after now), and a very specific point in time across partitions (probably when some other query stopped running). - When I get into a situation where data has been deleted by the retention mechanism without me seeing it, what should I do? Fail the query? Or issue a warning and compute best effort on the data available. This is what {{failOnDataLoss}} answers. In particular, I think the kafka method of configuration makes it confusing to do something like, "starting now, compute some aggregation exactly once". The documentation even points out some of the pit falls: bq. ... If this is set to largest, the consumer may lose some messages when the number of partitions, for the topics it subscribes to, changes on the broker. To prevent data loss during partition addition, set auto.offset.reset to smallest. Really what I want here is, "begin the query at largest", but "start new partitions at smallest (and in fact, tell me if I'm so late joining a new partition that I have already lost some data)". > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573512#comment-15573512 ] Ashish Shrowty commented on SPARK-17709: [~smilegator] I compiled with the added debug information and here is the output - {code:java} scala> val d1 = spark.sql("select * from testext2") d1: org.apache.spark.sql.DataFrame = [productid: int, price: float ... 2 more fields] scala> val df1 = d1.groupBy("companyid","productid").agg(sum("price").as("price")) df1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 more field] scala> val df2 = d1.groupBy("companyid","productid").agg(sum("count").as("count")) df2: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 more field] scala> df1.join(df2, Seq("companyid", "productid")).show org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] can not be resolved given input columns: [companyid, productid, price, count] ;; 'Join UsingJoin(Inner,List('companyid, 'productid)) :- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, sum(cast(price#123 as double)) AS price#166] : +- Project [productid#122, price#123, count#124, companyid#121] : +- SubqueryAlias testext2 :+- Relation[productid#122,price#123,count#124,companyid#121] parquet +- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, sum(cast(count#124 as bigint)) AS count#177L] +- Project [productid#122, price#123, count#124, companyid#121] +- SubqueryAlias testext2 +- Relation[productid#122,price#123,count#124,companyid#121] parquet at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589) at org.apache.spark.sql.Dataset.join(Dataset.scala:641) at org.apache.spark.sql.Dataset.join(Dataset.scala:614) ... 48 elided {code} > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573479#comment-15573479 ] Cody Koeninger commented on SPARK-17812: While some decision is better than none, can you help me understand why you don't believe me that auto.offset.reset is orthogonal to specifying specific starting positions? Or do you just not believe it's important? The reasons you guys used a different name from auto.offset.reset are that the Kafka project semantics of it are inadequate. But they will fix it, and when they do, the fact that you have conflated two unrelated things into one configuration in your api is going to cause problems. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17919) Make timeout to RBackend configurable in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573469#comment-15573469 ] Felix Cheung commented on SPARK-17919: -- Earlier bug: https://issues.apache.org/jira/browse/SPARK-12609 > Make timeout to RBackend configurable in SparkR > --- > > Key: SPARK-17919 > URL: https://issues.apache.org/jira/browse/SPARK-17919 > Project: Spark > Issue Type: Story > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > I am working on a project where {{gapply()}} is being used with a large > dataset that happens to be extremely skewed. On that skewed partition, the > user function takes more than 2 hours to return and that turns out to be > larger than the timeout that we hardcode in SparkR for backend connection. > {code} > connectBackend <- function(hostname, port, timeout = 6000) > {code} > Ideally user should be able to reconfigure Spark and increase the timeout. It > should be a small fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573459#comment-15573459 ] Michael Armbrust edited comment on SPARK-17812 at 10/13/16 10:53 PM: - +1 to the suggested ways of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {code} .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567}}""") {code} Were you can give -1 or -2 (again following kafka) for specific partitions. {{startingTime}} could be added when we support time indexes. was (Author: marmbrus): +1 to the suggested was of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {code} .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567}}""") {code} Were you can give -1 or -2 (again following kafka) for specific partitions. {{startingTime}} could be added when we support time indexes. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573459#comment-15573459 ] Michael Armbrust edited comment on SPARK-17812 at 10/13/16 10:53 PM: - +1 to the suggested was of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {code} .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567}}""") {code} Were you can give -1 or -2 (again following kafka) for specific partitions. {{startingTime}} could be added when we support time indexes. was (Author: marmbrus): +1 to the suggested was of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {{.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567\}\}"""}} > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573457#comment-15573457 ] Ofir Manor commented on SPARK-17812: I'm with you - I warned you it is bikeshedding... I don't have a strong opinion, just a preference, and what you suggested is way better then the commited solution, so I'll get out of the loop. Whatever [~marmbrus] and you are OK with - either way it would be a big step forward > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573459#comment-15573459 ] Michael Armbrust commented on SPARK-17812: -- +1 to the suggested was of subscribing, and for using "assign" as a familiar name. I would probably leave it with a single option like this: {{.option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 1234, "1", 4567\}\}"""}} > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573432#comment-15573432 ] Cody Koeninger edited comment on SPARK-17812 at 10/13/16 10:44 PM: --- If you're seriously worried that people are going to get confused, {noformat} .option("defaultOffsets", "earliest" | "latest") .option("specificOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""") {noformat} let those two at least not be mutually exclusive, and punt on the question of precedence until there's an actual startingTime or startingX ticket. was (Author: c...@koeninger.org): If you're seriously worried that people are going to get confused, {noformat} .option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""") .option("defaultOffsets", "earliest" | "latest") {noformat} let those two at least not be mutually exclusive, and punt on the question of precedence until there's an actual startingTime or startingX ticket. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573432#comment-15573432 ] Cody Koeninger commented on SPARK-17812: If you're seriously worried that people are going to get confused, {noformat} .option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""") .option("defaultOffsets", "earliest" | "latest") {noformat} let those two at least not be mutually exclusive, and punt on the question of precedence until there's an actual startingTime or startingX ticket. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573395#comment-15573395 ] Cody Koeninger edited comment on SPARK-17812 at 10/13/16 10:25 PM: --- 1. we dont have lists, we have strings. regexes and valid topic names have overlaps (dot is the obvious one). 2. Mapping directly to kafka method names means we don't have to come up with some other (weird and possibly overlapping) name when they add more ways to subscribe, we just use theirs. 3. I think this "starting X mssages" is a mess with kafka semantics for the reasons both you and I have already expressed. At any rate, I think Michael already clearly punted the "starting X messages" case to a different ticket. 4. I think it's more than sufficiently clear as suggested, no one is going to expect that a specific offset they provided is going to be overruled by a general single default. The implementation is also crystal clear - seek to the position identified by startingTime, then seek to any specific offsets for specific partitions Yes, this is all bikeshedding, but it's bikeshedding that directly affects what people are actually able to do with the api. Needlessly restricting it for reasons that have nothing to do with safety is just going to piss users off for no reason. Just because you don't have a use case that needs it, doesn't mean you should arbitrarily prevent users from doing it. Please, just choose something and let me build it so that people can actually use the thing by the next release was (Author: c...@koeninger.org): 1. we dont have lists, we have strings. regexes and valid topic names have overlaps (dot is the obvious one). 2. Mapping directly to kafka method names means we don't have to come up with some other (weird and possibly overlapping) name when they add more ways to subscribe, we just use theirs. 3. I think this is a mess with kafka semantics for the reasons both you and I have already expressed. At any rate, I think Michael already clearly punted the "starting X" case to a different topic. 4. I think it's more than sufficiently clear as suggested, no one is going to expect that a specific offset they provided is going to be overruled by a general single default. The implementation is also crystal clear - seek to the position identified by startingTime, then seek to any specific offsets for specific partitions Yes, this is all bikeshedding, but it's bikeshedding that directly affects what people are actually able to do with the api. Needlessly restricting it for reasons that have nothing to do with safety is just going to piss users off for no reason. Just because you don't have a use case that needs it, doesn't mean you should arbitrarily prevent users from doing it. Please, just choose something and let me build it so that people can actually use the thing by the next release > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573395#comment-15573395 ] Cody Koeninger commented on SPARK-17812: 1. we dont have lists, we have strings. regexes and valid topic names have overlaps (dot is the obvious one). 2. Mapping directly to kafka method names means we don't have to come up with some other (weird and possibly overlapping) name when they add more ways to subscribe, we just use theirs. 3. I think this is a mess with kafka semantics for the reasons both you and I have already expressed. At any rate, I think Michael already clearly punted the "starting X" case to a different topic. 4. I think it's more than sufficiently clear as suggested, no one is going to expect that a specific offset they provided is going to be overruled by a general single default. The implementation is also crystal clear - seek to the position identified by startingTime, then seek to any specific offsets for specific partitions Yes, this is all bikeshedding, but it's bikeshedding that directly affects what people are actually able to do with the api. Needlessly restricting it for reasons that have nothing to do with safety is just going to piss users off for no reason. Just because you don't have a use case that needs it, doesn't mean you should arbitrarily prevent users from doing it. Please, just choose something and let me build it so that people can actually use the thing by the next release > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17555) ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon
[ https://issues.apache.org/jira/browse/SPARK-17555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573363#comment-15573363 ] Eugene Zhulenev edited comment on SPARK-17555 at 10/13/16 10:21 PM: [~brdwrd] I had the same issue, and I figured out why. Basically you Spark executor is running in Docker container and writes shuffle blocks to /tmp directory, and passes this information to External Shuffle Service. But the problem is that shuffle service running in it's own Docker container. You have to share the save volume between your executors and shuffle service. I added this config to shuffle service marathon configuration: {code} "container": { "type": "DOCKER", "docker": { "image": "mesosphere/spark:1.0.2-2.0.0", "network": "HOST" }, "volumes": [ { "containerPath": "/var/data/spark", "hostPath": "/var/data/spark", "mode": "RW" } ] } {code} After that you need to specify *spark.mesos.executor.docker.volumes* and *spark.local.dir* parameters, so executors would write data to shared volume was (Author: ezhulenev): [~brdwrd] I had the same issue, and I figured out why. Basically you Spark executor is running in Docker container and writes shuffle blocks to /tmp directory, and passes this information to External Shuffle Service. But the problem is that shuffle service running in it's own Docker container. You have to share the save volume between your executors and shuffle service. I added this config to shuffle service marathon configuration: {code} "container": { "type": "DOCKER", "docker": { "image": "mesosphere/spark:1.0.2-2.0.0", "network": "HOST" }, "volumes": [ { "containerPath": "/var/data/spark", "hostPath": "/var/data/spark", "mode": "RW" } ] } {code} After that you need to specify spark.mesos.executor.docker.volumes and spark.local.dir parameters, so executors would write data to shared volume > ExternalShuffleBlockResolver fails randomly with External Shuffle Service > and Dynamic Resource Allocation on Mesos running under Marathon > -- > > Key: SPARK-17555 > URL: https://issues.apache.org/jira/browse/SPARK-17555 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1, 1.6.2, 2.0.0 > Environment: Mesos using docker with external shuffle service running > on Marathon. Running code from pyspark shell in client mode. >Reporter: Brad Willard > Labels: docker, dynamic_allocation, mesos, shuffle > > External Shuffle Service throws these errors about 90% of the time. It seems > to die between stages and work inconsistently with these style of errors > about missing files. I've tested this same behavior with all the spark > versions listed on the jira using the pre-build hadoop 2.6 distributions from > the apache spark download page. I also want to mention everything works > successfully with dynamic resource allocation turned off. > I have read other related bugs and have tried some of the > workaround/suggestions. Seems like some people have blamed the switch from > akka to netty which got me testing this in the 1.5* range with no luck. I'm > currently running these config option (informed by reading other bugs on jira > that seemed related to my problem). These settings have helped it work > sometimes instead of never. > {code} > spark.shuffle.service.port7338 > spark.shuffle.io.numConnectionsPerPeer 4 > spark.shuffle.io.connectionTimeout 18000s > spark.shuffle.service.enabled true > spark.dynamicAllocation.enabled true > {code} > on the driver for pyspark submit I'm sending along this config > {code} > --conf > spark.mesos.executor.docker.image=docker-registry.x.net/machine-learning/spark:spark-1-6-2v1 > \ > --conf spark.shuffle.service.enabled=true \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.mesos.coarse=true \ > --conf spark.cores.max=100 \ > --conf spark.executor.uri=$SPARK_EXECUTOR_URI \ > --conf spark.shuffle.service.port=7338 \ > --executor-memory 15g > {code} > Under Marathon I'm pinning each external shuffle service to an agent and > starting the service like this. > {code} > $SPARK_HOME/sbin/start-mesos-shuffle-service.sh && tail -f > $SPARK_HOME/logs/spark--org.apache.spark.deploy.mesos.MesosExternalShuffleService* > {code} > On startup it seems like all is well > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ >
[jira] [Commented] (SPARK-14212) Add configuration element for --packages option
[ https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573381#comment-15573381 ] holdenk commented on SPARK-14212: - Please do! I think I've outlined the basic steps in my comment above, but let me know if you have any questions. Thanks for helping out :) > Add configuration element for --packages option > --- > > Key: SPARK-14212 > URL: https://issues.apache.org/jira/browse/SPARK-14212 > Project: Spark > Issue Type: New Feature > Components: Documentation, PySpark >Affects Versions: 1.6.1 >Reporter: Russell Jurney >Priority: Trivial > Labels: config, starter > > I use PySpark with the --packages option, for instance to load support for > CSV: > pyspark --packages com.databricks:spark-csv_2.10:1.4.0 > I would like to not have to set this every time at the command line, so a > corresponding element for --packages in the configuration file > spark-defaults.conf, would be good to have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573380#comment-15573380 ] holdenk commented on SPARK-9487: +1 to [~srowen]'s comment. I would not be surprised to see some test failures because of the implicit change in the default partitioning as a result - but for most of those just updating the results will be the right course of action. Let me know if you have any questions [~kanjilal] and welcome to PySpark :) > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17923) dateFormat unexpected kwarg to df.write.csv
Evan Zamir created SPARK-17923: -- Summary: dateFormat unexpected kwarg to df.write.csv Key: SPARK-17923 URL: https://issues.apache.org/jira/browse/SPARK-17923 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Evan Zamir Priority: Minor Calling like this: {code}writer.csv(path, header=header, sep=sep, compression=compression, dateFormat=date_format){code} Getting the following error: {code}TypeError: csv() got an unexpected keyword argument 'dateFormat'{code} This error comes after being called with {code}date_format='-MM-dd'{code} as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12916) Support Row.fromSeq and Row.toSeq methods in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573371#comment-15573371 ] holdenk commented on SPARK-12916: - +1 with Hyukjin, I'll go ahead and close this as a "Won't Fix" > Support Row.fromSeq and Row.toSeq methods in pyspark > > > Key: SPARK-12916 > URL: https://issues.apache.org/jira/browse/SPARK-12916 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Shubhanshu Mishra >Priority: Minor > Labels: dataframe, pyspark, row, sql > > Pyspark should also have access to the Row functions like fromSeq and toSeq > which are exposed in the scala api. > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row > This will be useful when constructing custom columns from function called in > dataframes. A good example is present in the following SO threat: > http://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe > {code:python} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > def foobarFunc(x: Long, y: Double, z: String): Seq[Any] = > Seq(x * y, z.head.toInt * y) > val schema = StructType(df.schema.fields ++ > Array(StructField("foo", DoubleType), StructField("bar", DoubleType))) > val rows = df.rdd.map(r => Row.fromSeq( > r.toSeq ++ > foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z" > val df2 = sqlContext.createDataFrame(rows, schema) > df2.show > // +---++---++-+ > // | x| y| z| foo| bar| > // +---++---++-+ > // | 1| 3.0| a| 3.0|291.0| > // | 2|-1.0| b|-2.0|-98.0| > // | 3| 0.0| c| 0.0| 0.0| > // +---++---++-+ > {code} > I am ready to work on this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12916) Support Row.fromSeq and Row.toSeq methods in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk closed SPARK-12916. --- Resolution: Won't Fix Since Row is now a subclass of Tuple we don't really need this anymore. > Support Row.fromSeq and Row.toSeq methods in pyspark > > > Key: SPARK-12916 > URL: https://issues.apache.org/jira/browse/SPARK-12916 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Shubhanshu Mishra >Priority: Minor > Labels: dataframe, pyspark, row, sql > > Pyspark should also have access to the Row functions like fromSeq and toSeq > which are exposed in the scala api. > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row > This will be useful when constructing custom columns from function called in > dataframes. A good example is present in the following SO threat: > http://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe > {code:python} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > def foobarFunc(x: Long, y: Double, z: String): Seq[Any] = > Seq(x * y, z.head.toInt * y) > val schema = StructType(df.schema.fields ++ > Array(StructField("foo", DoubleType), StructField("bar", DoubleType))) > val rows = df.rdd.map(r => Row.fromSeq( > r.toSeq ++ > foobarFunc(r.getAs[Long]("x"), r.getAs[Double]("y"), r.getAs[String]("z" > val df2 = sqlContext.createDataFrame(rows, schema) > df2.show > // +---++---++-+ > // | x| y| z| foo| bar| > // +---++---++-+ > // | 1| 3.0| a| 3.0|291.0| > // | 2|-1.0| b|-2.0|-98.0| > // | 3| 0.0| c| 0.0| 0.0| > // +---++---++-+ > {code} > I am ready to work on this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16720) Loading CSV file with 2k+ columns fails during attribute resolution on action
[ https://issues.apache.org/jira/browse/SPARK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573367#comment-15573367 ] holdenk commented on SPARK-16720: - Sounds good - go ahead and close this :) > Loading CSV file with 2k+ columns fails during attribute resolution on action > - > > Key: SPARK-16720 > URL: https://issues.apache.org/jira/browse/SPARK-16720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: holdenk > > Example shell for repro: > {quote} > scala> val df =spark.read.format("csv").option("header", > "true").option("inferSchema", "true").load("/home/holden/Downloads/ex*.csv") > df: org.apache.spark.sql.DataFrame = [Date: string, Lifetime Total Likes: int > ... 2125 more fields] > scala> df.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(Date,StringType,true), StructField(Lifetime Total > Likes,IntegerType,true), StructField(Daily New Likes,IntegerType,true), > StructField(Daily Unlikes,IntegerType,true), StructField(Daily Page Engaged > Users,IntegerType,true), StructField(Weekly Page Engaged > Users,IntegerType,true), StructField(28 Days Page Engaged > Users,IntegerType,true), StructField(Daily Like Sources - On Your > Page,IntegerType,true), StructField(Daily Total Reach,IntegerType,true), > StructField(Weekly Total Reach,IntegerType,true), StructField(28 Days Total > Reach,IntegerType,true), StructField(Daily Organic Reach,IntegerType,true), > StructField(Weekly Organic Reach,IntegerType,true), StructField(28 Days > Organic Reach,IntegerType,true), StructField(Daily T... > scala> df.take(1) > [GIANT LIST OF COLUMNS] > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) > at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$pl
[jira] [Commented] (SPARK-17555) ExternalShuffleBlockResolver fails randomly with External Shuffle Service and Dynamic Resource Allocation on Mesos running under Marathon
[ https://issues.apache.org/jira/browse/SPARK-17555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573363#comment-15573363 ] Eugene Zhulenev commented on SPARK-17555: - [~brdwrd] I had the same issue, and I figured out why. Basically you Spark executor is running in Docker container and writes shuffle blocks to /tmp directory, and passes this information to External Shuffle Service. But the problem is that shuffle service running in it's own Docker container. You have to share the save volume between your executors and shuffle service. I added this config to shuffle service marathon configuration: {code} "container": { "type": "DOCKER", "docker": { "image": "mesosphere/spark:1.0.2-2.0.0", "network": "HOST" }, "volumes": [ { "containerPath": "/var/data/spark", "hostPath": "/var/data/spark", "mode": "RW" } ] } {code} After that you need to specify spark.mesos.executor.docker.volumes and spark.local.dir parameters, so executors would write data to shared volume > ExternalShuffleBlockResolver fails randomly with External Shuffle Service > and Dynamic Resource Allocation on Mesos running under Marathon > -- > > Key: SPARK-17555 > URL: https://issues.apache.org/jira/browse/SPARK-17555 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1, 1.6.2, 2.0.0 > Environment: Mesos using docker with external shuffle service running > on Marathon. Running code from pyspark shell in client mode. >Reporter: Brad Willard > Labels: docker, dynamic_allocation, mesos, shuffle > > External Shuffle Service throws these errors about 90% of the time. It seems > to die between stages and work inconsistently with these style of errors > about missing files. I've tested this same behavior with all the spark > versions listed on the jira using the pre-build hadoop 2.6 distributions from > the apache spark download page. I also want to mention everything works > successfully with dynamic resource allocation turned off. > I have read other related bugs and have tried some of the > workaround/suggestions. Seems like some people have blamed the switch from > akka to netty which got me testing this in the 1.5* range with no luck. I'm > currently running these config option (informed by reading other bugs on jira > that seemed related to my problem). These settings have helped it work > sometimes instead of never. > {code} > spark.shuffle.service.port7338 > spark.shuffle.io.numConnectionsPerPeer 4 > spark.shuffle.io.connectionTimeout 18000s > spark.shuffle.service.enabled true > spark.dynamicAllocation.enabled true > {code} > on the driver for pyspark submit I'm sending along this config > {code} > --conf > spark.mesos.executor.docker.image=docker-registry.x.net/machine-learning/spark:spark-1-6-2v1 > \ > --conf spark.shuffle.service.enabled=true \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.mesos.coarse=true \ > --conf spark.cores.max=100 \ > --conf spark.executor.uri=$SPARK_EXECUTOR_URI \ > --conf spark.shuffle.service.port=7338 \ > --executor-memory 15g > {code} > Under Marathon I'm pinning each external shuffle service to an agent and > starting the service like this. > {code} > $SPARK_HOME/sbin/start-mesos-shuffle-service.sh && tail -f > $SPARK_HOME/logs/spark--org.apache.spark.deploy.mesos.MesosExternalShuffleService* > {code} > On startup it seems like all is well > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Python version 3.5.2 (default, Jul 2 2016 17:52:12) > SparkContext available as sc, HiveContext available as sqlContext. > >>> 16/09/15 11:35:53 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now > >>> TASK_RUNNING > 16/09/15 11:35:53 INFO MesosExternalShuffleClient: Successfully registered > app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service. > 16/09/15 11:35:55 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now > TASK_RUNNING > 16/09/15 11:35:55 INFO MesosExternalShuffleClient: Successfully registered > app a7a50f09-0ce0-4417-91a9-fa694416e903-0091 with external shuffle service. > 16/09/15 11:35:56 INFO CoarseMesosSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (mesos-agent002.[redacted]:61281) with ID > 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 > 16/09/15 11:35:56 INFO ExecutorAllocationManager: New executor > 55300bb1-aca1-4dd1-8647-ce4a50f6d661-S6/1 has registered (new total is 1) > 16/
[jira] [Commented] (SPARK-10972) UDFs in SQL joins
[ https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573364#comment-15573364 ] holdenk commented on SPARK-10972: - I don't think that actually solves the problem the user is looking for. You could do a full cross product and filter after but that's pretty expensive. > UDFs in SQL joins > - > > Key: SPARK-10972 > URL: https://issues.apache.org/jira/browse/SPARK-10972 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1 >Reporter: Michael Malak > > Currently expressions used to .join() in DataFrames are limited to column > names plus the operators exposed in org.apache.spark.sql.Column. > It would be nice to be able to .join() based on a UDF, such as, say, > euclideanDistance(col1, col2) < 0.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573360#comment-15573360 ] holdenk commented on SPARK-650: --- Would people feel ok if we marked this as a duplicate of 636 since it does seem like this a subset of 636. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly
[ https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kanika dhuria updated SPARK-17922: -- Description: I am using spark 2.0 Seeing class loading issue because the whole stage code gen is generating multiple classes with same name as "org.apache.spark.sql.catalyst.expressions.GeneratedClass" I am using dataframe transform. and within transform i use Osgi. Osgi replaces the thread context class loader to ContextFinder which looks at all the class loaders in the stack to find out the new generated class and finds the GeneratedClass with inner class GeneratedIterator byteclass loader(instead of falling back to the byte class loader created by janino compiler), since the class name is same that byte class loader loads the class and returns GeneratedClass$GeneratedIterator instead of expected GeneratedClass$UnsafeProjection. Can we generate different classes with different names or is it expected to generate one class only? This is the somewhat I am trying to do {noformat} import org.apache.spark.sql._ import org.apache.spark.sql.types._ import com.databricks.spark.avro._ def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = { //Initialize osgi (rows:Iterator[Row]) => { var outi = Iterator[Row]() while(rows.hasNext) { val r = rows.next outi = outi.++(Iterator(Row(r.get(0 } //val ors = Row("abc") //outi =outi.++( Iterator(ors)) outi } } def transform1( outType:StructType) :((DataFrame) => DataFrame) = { (d:DataFrame) => { val inType = d.schema val rdd = d.rdd.mapPartitions(exePart(outType)) d.sqlContext.createDataFrame(rdd, outType) } } val df = spark.read.avro("file:///data/builds/a1.avro") val df1 = df.select($"id2").filter(false) val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, true)::Nil))).createOrReplaceTempView("tbl0") spark.sql("insert overwrite table testtable select p1 from tbl0") {noformat} was: I am using spark 2.0 Seeing class loading issue because the whole stage code gen is generating multiple classes with same name as "org.apache.spark.sql.catalyst.expressions.GeneratedClass" I am using dataframe transform. and within transform i use Osgi. Osgi replaces the thread context class loader to ContextFinder which looks at all the class loaders in the stack to find out the new generated class and finds the GeneratedClass with inner class GeneratedIterator byteclass loader(instead of falling back to the byte class loader created by janino compiler), since the class name is same that byte class loader loads the class and returns GeneratedClass$GeneratedIterator instead of expected GeneratedClass$UnsafeProjection. Can we generate different classes with different names or is it expected to generate one class only? This is the somewhat I am trying to do import org.apache.spark.sql._ import org.apache.spark.sql.types._ import com.databricks.spark.avro._ def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = { //Initialize osgi (rows:Iterator[Row]) => { var outi = Iterator[Row]() while(rows.hasNext) { val r = rows.next outi = outi.++(Iterator(Row(r.get(0 } //val ors = Row("abc") //outi =outi.++( Iterator(ors)) outi } } def transform1( outType:StructType) :((DataFrame) => DataFrame) = { (d:DataFrame) => { val inType = d.schema val rdd = d.rdd.mapPartitions(exePart(outType)) d.sqlContext.createDataFrame(rdd, outType) } } val df = spark.read.avro("file:///data/builds/a1.avro") val df1 = df.select($"id2").filter(false) val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, true)::Nil))).createOrReplaceTempView("tbl0") spark.sql("insert overwrite table testtable select p1 from tbl0") > ClassCastException java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator > cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection > - > > Key: SPARK-17922 > URL: https://issues.apache.org/jira/browse/SPARK-17922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: kanika dhuria > > I am using spark 2.0 > Seeing class loading issue because the whole stage code gen is generating > multiple classes with same name as > "org.apache.spark.sql.catalyst.expressions.GeneratedClass" > I am usi
[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly
[ https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kanika dhuria updated SPARK-17922: -- Description: I am using spark 2.0 Seeing class loading issue because the whole stage code gen is generating multiple classes with same name as "org.apache.spark.sql.catalyst.expressions.GeneratedClass" I am using dataframe transform. and within transform i use Osgi. Osgi replaces the thread context class loader to ContextFinder which looks at all the class loaders in the stack to find out the new generated class and finds the GeneratedClass with inner class GeneratedIterator byteclass loader(instead of falling back to the byte class loader created by janino compiler), since the class name is same that byte class loader loads the class and returns GeneratedClass$GeneratedIterator instead of expected GeneratedClass$UnsafeProjection. Can we generate different classes with different names or is it expected to generate one class only? This is the somewhat I am trying to do import org.apache.spark.sql._ import org.apache.spark.sql.types._ import com.databricks.spark.avro._ def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = { //Initialize osgi (rows:Iterator[Row]) => { var outi = Iterator[Row]() while(rows.hasNext) { val r = rows.next outi = outi.++(Iterator(Row(r.get(0 } //val ors = Row("abc") //outi =outi.++( Iterator(ors)) outi } } def transform1( outType:StructType) :((DataFrame) => DataFrame) = { (d:DataFrame) => { val inType = d.schema val rdd = d.rdd.mapPartitions(exePart(outType)) d.sqlContext.createDataFrame(rdd, outType) } } val df = spark.read.avro("file:///data/builds/a1.avro") val df1 = df.select($"id2").filter(false) val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, true)::Nil))).createOrReplaceTempView("tbl0") spark.sql("insert overwrite table testtable select p1 from tbl0") was: I am using spark 2.0 Seeing class loading issue because the whole stage code gen is generating multiple classes with same name as "org.apache.spark.sql.catalyst.expressions.GeneratedClass" I am using dataframe transform. and within transform i use Osgi. Osgi replaces the thread context class loader to ContextFinder which looks at all the class loaders in the stack to find out the new generated class and finds the GeneratedClass with inner class GeneratedIterator byteclass loader(instead of falling back to the byte class loader created by janino compiler), since the class name is same that byte class loader loads the class and returns GeneratedClass$GeneratedIterator instead of expected GeneratedClass$UnsafeProjection. Can we generate different classes with different names or is it expected to generate one class only. This is the rough repro import org.apache.spark.sql._ import org.apache.spark.sql.types._ import com.databricks.spark.avro._ def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = { //Initialize osgi (rows:Iterator[Row]) => { var outi = Iterator[Row]() while(rows.hasNext) { val r = rows.next outi = outi.++(Iterator(Row(r.get(0 } //val ors = Row("abc") //outi =outi.++( Iterator(ors)) outi } } def transform1( outType:StructType) :((DataFrame) => DataFrame) = { (d:DataFrame) => { val inType = d.schema val rdd = d.rdd.mapPartitions(exePart(outType)) d.sqlContext.createDataFrame(rdd, outType) } } val df = spark.read.avro("file:///data/builds/a1.avro") val df1 = df.select($"id2").filter(false) val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, true)::Nil))).createOrReplaceTempView("tbl0") spark.sql("insert overwrite table testtable select p1 from tbl0") > ClassCastException java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator > cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection > - > > Key: SPARK-17922 > URL: https://issues.apache.org/jira/browse/SPARK-17922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: kanika dhuria > > I am using spark 2.0 > Seeing class loading issue because the whole stage code gen is generating > multiple classes with same name as > "org.apache.spark.sql.catalyst.expressions.GeneratedClass" > I am using dataframe transform. and within tran
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573341#comment-15573341 ] Ofir Manor commented on SPARK-17812: Thanks Cody! great to have a concrete example. I've some comments, but its mostly bikeshedding 1. subscribe vs. subscribePattern --> personally, I would combine them both to "subscribe" - no need to burden the user with the different Kafka API nuances. It can get a list of discreet topics or a pattern. 2. It would be much clearer if "assign" was called subscribeSomething, so the user would choose one "subscribe.." and one (or more) "starting...". Not sure I have a good name though - subscribeCustom? You can even use the regular subscribe for that (and be smarter with the pattern matching) - I think it would just work, and if someone tries to be funny (combine astrerix and partitions) we could just error 3. I like startingTime... pretty neat. We could hypothetically add {{.option("startingMessages", long)}} to support Michael's "just start with a 1000 recent messages"... 4. As I said before, I'd rather have all starting* be mutual-exclusive. Yes, it blocks some edge cases, on purpose, but make the API and code way clearer (think about startingMessage interacting with startingOffsets etc). I think that it would be easier to regret and allow multiple starting* in the future (opening all sorts of esoteric combinations) than clean it up in the future if users find it confusing and not needed. Anyway, as long as it is functional I'm good with it, even if it less aesthetic. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly
[ https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kanika dhuria updated SPARK-17922: -- Description: I am using spark 2.0 Seeing class loading issue because the whole stage code gen is generating multiple classes with same name as "org.apache.spark.sql.catalyst.expressions.GeneratedClass" I am using dataframe transform. and within transform i use Osgi. Osgi replaces the thread context class loader to ContextFinder which looks at all the class loaders in the stack to find out the new generated class and finds the GeneratedClass with inner class GeneratedIterator byteclass loader(instead of falling back to the byte class loader created by janino compiler), since the class name is same that byte class loader loads the class and returns GeneratedClass$GeneratedIterator instead of expected GeneratedClass$UnsafeProjection. Can we generate different classes with different names or is it expected to generate one class only. This is the rough repro import org.apache.spark.sql._ import org.apache.spark.sql.types._ import com.databricks.spark.avro._ def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = { //Initialize osgi (rows:Iterator[Row]) => { var outi = Iterator[Row]() while(rows.hasNext) { val r = rows.next outi = outi.++(Iterator(Row(r.get(0 } //val ors = Row("abc") //outi =outi.++( Iterator(ors)) outi } } def transform1( outType:StructType) :((DataFrame) => DataFrame) = { (d:DataFrame) => { val inType = d.schema val rdd = d.rdd.mapPartitions(exePart(outType)) d.sqlContext.createDataFrame(rdd, outType) } } val df = spark.read.avro("file:///data/builds/a1.avro") val df1 = df.select($"id2").filter(false) val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, true)::Nil))).createOrReplaceTempView("tbl0") spark.sql("insert overwrite table testtable select p1 from tbl0") was: I am using spark 2.0 Seeing class loading issue because the whole stage code gen is generating multiple classes with same name as "org.apache.spark.sql.catalyst.expressions.GeneratedClass" I am using dataframe transform. and within transform i use Osgi. Osgi replaces the thread context class loader to ContextFinder which looks at all the class loaders in the stack to find out the new generated class and finds the GeneratedClass with inner class GeneratedIterator byteclass loader(instead of falling back to the byte class loader created by janino compiler), since the class name is same that byte class loader loads the class and returns GeneratedClass$GeneratedIterator instead of expected GeneratedClass$UnsafeProjection. > ClassCastException java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator > cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection > - > > Key: SPARK-17922 > URL: https://issues.apache.org/jira/browse/SPARK-17922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: kanika dhuria > > I am using spark 2.0 > Seeing class loading issue because the whole stage code gen is generating > multiple classes with same name as > "org.apache.spark.sql.catalyst.expressions.GeneratedClass" > I am using dataframe transform. and within transform i use Osgi. > Osgi replaces the thread context class loader to ContextFinder which looks at > all the class loaders in the stack to find out the new generated class and > finds the GeneratedClass with inner class GeneratedIterator byteclass > loader(instead of falling back to the byte class loader created by janino > compiler), since the class name is same that byte class loader loads the > class and returns GeneratedClass$GeneratedIterator instead of expected > GeneratedClass$UnsafeProjection. > Can we generate different classes with different names or is it expected to > generate one class only. > This is the rough repro > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import com.databricks.spark.avro._ > def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = { > //Initialize osgi > (rows:Iterator[Row]) => { > var outi = Iterator[Row]() > while(rows.hasNext) { > val r = rows.next > outi = outi.++(Iterator(Row(r.get(0 > } > //val ors = Row("abc") > //outi =outi.++( Iterator(ors)) > outi > } >
[jira] [Created] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly
kanika dhuria created SPARK-17922: - Summary: ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection Key: SPARK-17922 URL: https://issues.apache.org/jira/browse/SPARK-17922 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: kanika dhuria I am using spark 2.0 Seeing class loading issue because the whole stage code gen is generating multiple classes with same name as "org.apache.spark.sql.catalyst.expressions.GeneratedClass" I am using dataframe transform. and within transform i use Osgi. Osgi replaces the thread context class loader to ContextFinder which looks at all the class loaders in the stack to find out the new generated class and finds the GeneratedClass with inner class GeneratedIterator byteclass loader(instead of falling back to the byte class loader created by janino compiler), since the class name is same that byte class loader loads the class and returns GeneratedClass$GeneratedIterator instead of expected GeneratedClass$UnsafeProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17460) Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception
[ https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Perluss updated SPARK-17460: -- Component/s: SQL > Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception > -- > > Key: SPARK-17460 > URL: https://issues.apache.org/jira/browse/SPARK-17460 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Spark 2.0 in local mode as well as on GoogleDataproc >Reporter: Chris Perluss > > Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes > in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0. > The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of > datatype Int. In my dataset, there is an Array column whose data size > exceeds the limits of an Int and so the data size becomes negative. > The issue can be repeated by running this code in REPL: > val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that > long of a string")).toDS() > // You might have to remove private[sql] from Dataset.logicalPlan to get this > to work > val stats = ds.logicalPlan.statistics > yields > stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(-1890686892,false) > This causes joinWith to performWith to perform a broadcast join even tho my > data is gigabytes in size, which of course causes the executors to run out of > memory. > Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the > logicalPlan.statistics.sizeInBytes is a large negative number and thus it is > less than the join threshold of -1. > I've been able to work around this issue by setting > autoBroadcastJoinThreshold to a very large negative number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573287#comment-15573287 ] holdenk commented on SPARK-15369: - I can understand the hesitancy to adopt this long term - I wish we could explore exposing this as a developer API instead but if that isn't something of interest I'll take a look at adding this as a spark package (or adding to one of the meta-spark packages like Bahir). In the meantime I'll do some more poking with Arrow and see if its getting to a point where we could selectively start using it in PySpark. > Investigate selectively using Jython for parts of PySpark > - > > Key: SPARK-15369 > URL: https://issues.apache.org/jira/browse/SPARK-15369 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Minor > > Transferring data from the JVM to the Python executor can be a substantial > bottleneck. While Jython is not suitable for all UDFs or map functions, it > may be suitable for some simple ones. We should investigate the option of > using Jython to accelerate these small functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-17812: --- Comment: was deleted (was: One other slightly ugly thing... {noformat} // starting topicpartitions, no explicit offset .option("assign", """{"topicfoo": [0, 1],"topicbar": [0, 1]}""" // do you allow specifying with explicit offsets in the same config option? // or force it all into startingOffsetForRealzYo? .option("assign", """{ "topicfoo" :{ "0": 1234, "1": 4567 }, "topicbar" : { "0": 1234, "1": 4567 }}""") {noformat}) > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17731) Metrics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573216#comment-15573216 ] Apache Spark commented on SPARK-17731: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/15472 > Metrics for Structured Streaming > > > Key: SPARK-17731 > URL: https://issues.apache.org/jira/browse/SPARK-17731 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > > Metrics are needed for monitoring structured streaming apps. Here is the > design doc for implementing the necessary metrics. > https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573166#comment-15573166 ] Cody Koeninger edited comment on SPARK-17812 at 10/13/16 9:17 PM: -- Here's my concrete suggestion: 3 mutually exclusive ways of subscribing: {noformat} .option("subscribe","topicFoo,topicBar") .option("subscribePattern","topic.*") .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") {noformat} where assign can only be specified that way, no inline offsets 2 non-mutually exclusive ways of specifying starting position, explicit startingOffsets obviously take priority: {noformat} .option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}}""") .option("startingTime", "earliest" | "latest" | long) {noformat} where long is a timestamp, work to be done on that later. Note that even kafka 0.8 has a (really crappy based on log file modification time) api for time so later pursuing timestamps startingTime doesn't necessarily exclude it was (Author: c...@koeninger.org): Here's my concrete suggestion: 3 mutually exclusive ways of subscribing: {noformat} .option("subscribe","topicFoo,topicBar") .option("subscribePattern","topic.*") .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") {noformat} where assign can only be specified that way, no inline offsets 2 non-mutually exclusive ways of specifying starting position, explicit startingOffsets obviously take priority: {noformat} .option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}""") .option("startingTime", "earliest" | "latest" | long) {noformat} where long is a timestamp, work to be done on that later. Note that even kafka 0.8 has a (really crappy based on log file modification time) api for time so later pursuing timestamps startingTime doesn't necessarily exclude it > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17919) Make timeout to RBackend configurable in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17919: Assignee: (was: Apache Spark) > Make timeout to RBackend configurable in SparkR > --- > > Key: SPARK-17919 > URL: https://issues.apache.org/jira/browse/SPARK-17919 > Project: Spark > Issue Type: Story > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > I am working on a project where {{gapply()}} is being used with a large > dataset that happens to be extremely skewed. On that skewed partition, the > user function takes more than 2 hours to return and that turns out to be > larger than the timeout that we hardcode in SparkR for backend connection. > {code} > connectBackend <- function(hostname, port, timeout = 6000) > {code} > Ideally user should be able to reconfigure Spark and increase the timeout. It > should be a small fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17919) Make timeout to RBackend configurable in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573210#comment-15573210 ] Apache Spark commented on SPARK-17919: -- User 'falaki' has created a pull request for this issue: https://github.com/apache/spark/pull/15471 > Make timeout to RBackend configurable in SparkR > --- > > Key: SPARK-17919 > URL: https://issues.apache.org/jira/browse/SPARK-17919 > Project: Spark > Issue Type: Story > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > I am working on a project where {{gapply()}} is being used with a large > dataset that happens to be extremely skewed. On that skewed partition, the > user function takes more than 2 hours to return and that turns out to be > larger than the timeout that we hardcode in SparkR for backend connection. > {code} > connectBackend <- function(hostname, port, timeout = 6000) > {code} > Ideally user should be able to reconfigure Spark and increase the timeout. It > should be a small fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17919) Make timeout to RBackend configurable in SparkR
[ https://issues.apache.org/jira/browse/SPARK-17919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17919: Assignee: Apache Spark > Make timeout to RBackend configurable in SparkR > --- > > Key: SPARK-17919 > URL: https://issues.apache.org/jira/browse/SPARK-17919 > Project: Spark > Issue Type: Story > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki >Assignee: Apache Spark > > I am working on a project where {{gapply()}} is being used with a large > dataset that happens to be extremely skewed. On that skewed partition, the > user function takes more than 2 hours to return and that turns out to be > larger than the timeout that we hardcode in SparkR for backend connection. > {code} > connectBackend <- function(hostname, port, timeout = 6000) > {code} > Ideally user should be able to reconfigure Spark and increase the timeout. It > should be a small fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17661) Consolidate various listLeafFiles implementations
[ https://issues.apache.org/jira/browse/SPARK-17661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17661. - Resolution: Fixed Assignee: Peter Lee Fix Version/s: 2.1.0 > Consolidate various listLeafFiles implementations > - > > Key: SPARK-17661 > URL: https://issues.apache.org/jira/browse/SPARK-17661 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.1.0 > > > There are 4 listLeafFiles-related functions in Spark: > - ListingFileCatalog.listLeafFiles (which calls > HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is > greater than a threshold; if it is lower, then it has its own serial version > implemented) > - HadoopFsRelation.listLeafFiles (called only by > HadoopFsRelation.listLeafFilesInParallel) > - HadoopFsRelation.listLeafFilesInParallel (called only by > ListingFileCatalog.listLeafFiles) > It is actually very confusing and error prone because there are effectively > two distinct implementations for the serial version of listing leaf files. > This code can be improved by: > - Move all file listing code into ListingFileCatalog, since it is the only > class that needs this. > - Keep only one function for listing files in serial. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573166#comment-15573166 ] Cody Koeninger commented on SPARK-17812: Here's my concrete suggestion: 3 mutually exclusive ways of subscribing: {noformat} .option("subscribe","topicFoo,topicBar") .option("subscribePattern","topic.*") .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""") {noformat} where assign can only be specified that way, no inline offsets 2 non-mutually exclusive ways of specifying starting position, explicit startingOffsets obviously take priority: {noformat} .option("startingOffsets", """{"topicFoo": {"0": 1234, "1", 4567}""") .option("startingTime", "earliest" | "latest" | long) {noformat} where long is a timestamp, work to be done on that later. Note that even kafka 0.8 has a (really crappy based on log file modification time) api for time so later pursuing timestamps startingTime doesn't necessarily exclude it > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast
[ https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17921: Assignee: (was: Apache Spark) > checkpointLocation being set in memory streams fail after restart. Should > fail fast > --- > > Key: SPARK-17921 > URL: https://issues.apache.org/jira/browse/SPARK-17921 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Affects Versions: 2.0.0, 2.0.1 >Reporter: Burak Yavuz > > The checkpointLocation option in memory streams in StructuredStreaming is not > used during recovery. However, it can use this location if it is being set. > However, during recovery, if this location was set, we get an exception > saying that we will not use this location for recovery, please delete it. > It's better to just fail before you start the stream in the first place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast
[ https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17921: Assignee: Apache Spark > checkpointLocation being set in memory streams fail after restart. Should > fail fast > --- > > Key: SPARK-17921 > URL: https://issues.apache.org/jira/browse/SPARK-17921 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Affects Versions: 2.0.0, 2.0.1 >Reporter: Burak Yavuz >Assignee: Apache Spark > > The checkpointLocation option in memory streams in StructuredStreaming is not > used during recovery. However, it can use this location if it is being set. > However, during recovery, if this location was set, we get an exception > saying that we will not use this location for recovery, please delete it. > It's better to just fail before you start the stream in the first place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast
[ https://issues.apache.org/jira/browse/SPARK-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573163#comment-15573163 ] Apache Spark commented on SPARK-17921: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/15470 > checkpointLocation being set in memory streams fail after restart. Should > fail fast > --- > > Key: SPARK-17921 > URL: https://issues.apache.org/jira/browse/SPARK-17921 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Affects Versions: 2.0.0, 2.0.1 >Reporter: Burak Yavuz > > The checkpointLocation option in memory streams in StructuredStreaming is not > used during recovery. However, it can use this location if it is being set. > However, during recovery, if this location was set, we get an exception > saying that we will not use this location for recovery, please delete it. > It's better to just fail before you start the stream in the first place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17921) checkpointLocation being set in memory streams fail after restart. Should fail fast
Burak Yavuz created SPARK-17921: --- Summary: checkpointLocation being set in memory streams fail after restart. Should fail fast Key: SPARK-17921 URL: https://issues.apache.org/jira/browse/SPARK-17921 Project: Spark Issue Type: Bug Components: SQL, Streaming Affects Versions: 2.0.1, 2.0.0 Reporter: Burak Yavuz The checkpointLocation option in memory streams in StructuredStreaming is not used during recovery. However, it can use this location if it is being set. However, during recovery, if this location was set, we get an exception saying that we will not use this location for recovery, please delete it. It's better to just fail before you start the stream in the first place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17731) Metrics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-17731. --- Resolution: Fixed Fix Version/s: 2.1.0 Target Version/s: 2.0.2, 2.1.0 (was: 2.1.0) > Metrics for Structured Streaming > > > Key: SPARK-17731 > URL: https://issues.apache.org/jira/browse/SPARK-17731 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > > Metrics are needed for monitoring structured streaming apps. Here is the > design doc for implementing the necessary metrics. > https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573109#comment-15573109 ] Ofir Manor commented on SPARK-17812: Regarding (1) - of course it is *all* data in the source, as of query start. Just the same as file system directory or a database table - I'm not sure a disclaimer that the directory or table could have had different data in the past adds anything but confusion... Anyway, the startingOffset is confusing because, it seems you want a different parameter for "assign" --> to explicitly specify starting offsets. For you use case, I would add: 5. Give me nnn messages (not last ones). We still do one of the above options (trying to go back nnn messages, somehow split between the topic-partitions involved), but not provide a more explicit guarantee like "last nnn". Generally, the distribution of messages to partitions don't have to be round-robin or uniform, it is strongly based on the key (example, could be state, could be URL etc). Anyway, I haven't seen a concrete suggestion on how to specify offsets or timestamp, so I think that would be the next step in this ticket (I suggested you could condense all to one option to avoid dependencies between options, but I don't have an elegant "stringly" suggestion) > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572922#comment-15572922 ] Cody Koeninger edited comment on SPARK-17812 at 10/13/16 8:33 PM: -- Sorry, I didn't see this comment until just now. X offsets back per partition is not a reasonable proxy for time when you're dealing with a stream that has multiple topics in it. Agree we should break that out, focus on defining starting offsets in this ticket. The concern with startingOffsets naming is that, because auto.offset.reset is orthogonal to specifying some offsets, you have a situation like this: {noformat} .format("kafka") .option("subscribePattern", "topic.*") .option("startingOffset", "latest") .option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 }, "topicbar" : { "0": 1234, "1": 4567 }}""") {noformat} where startingOffsetForRealzYo has a more reasonable name that conveys it is specifying starting offsets, yet is not confusingly similar to startingOffset Trying to hack it all into one json as an alternative, with a "default" topic, means you're going to have to pick a key that isn't a valid topic, or add yet another layer of indirection. It also makes it harder to make the format consistent with SPARK-17829 (which seems like a good thing to keep consistent, I agree) Obviously I think you should just change the name, but it's your show. was (Author: c...@koeninger.org): Sorry, I didn't see this comment until just now. X offsets back per partition is not a reasonable proxy for time when you're dealing with a stream that has multiple topics in it. Agree we should break that out, focus on defining starting offsets in this ticket. The concern with startingOffsets naming is that, because auto.offset.reset is orthogonal to specifying some offsets, you have a situation like this: .format("kafka") .option("subscribePattern", "topic.*") .option("startingOffset", "latest") .option("startingOffsetForRealzYo", """ { "topicfoo" : { "0": 1234, "1": 4567 }, "topicbar" : { "0": 1234, "1": 4567 }}""") where startingOffsetForRealzYo has a more reasonable name that conveys it is specifying starting offsets, yet is not confusingly similar to startingOffset Trying to hack it all into one json as an alternative, with a "default" topic, means you're going to have to pick a key that isn't a valid topic, or add yet another layer of indirection. It also makes it harder to make the format consistent with SPARK-17829 (which seems like a good thing to keep consistent, I agree) Obviously I think you should just change the name, but it's your show. > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17834) Fetch the earliest offsets manually in KafkaSource instead of counting on KafkaConsumer
[ https://issues.apache.org/jira/browse/SPARK-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17834. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 > Fetch the earliest offsets manually in KafkaSource instead of counting on > KafkaConsumer > --- > > Key: SPARK-17834 > URL: https://issues.apache.org/jira/browse/SPARK-17834 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.2, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17900) Mark the following Spark SQL APIs as stable
[ https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17900: Assignee: Apache Spark (was: Reynold Xin) > Mark the following Spark SQL APIs as stable > --- > > Key: SPARK-17900 > URL: https://issues.apache.org/jira/browse/SPARK-17900 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Mark the following stable: > Dataset/DataFrame > - functions, since 1.3 > - ColumnName, since 1.3 > - DataFrameNaFunctions, since 1.3.1 > - DataFrameStatFunctions, since 1.4 > - UserDefinedFunction, since 1.3 > - UserDefinedAggregateFunction, since 1.5 > - Window and WindowSpec, since 1.4 > Data sources: > - DataSourceRegister, since 1.5 > - RelationProvider, since 1.3 > - SchemaRelationProvider, since 1.3 > - CreatableRelationProvider, since 1.3 > - BaseRelation, since 1.3 > - TableScan, since 1.3 > - PrunedScan, since 1.3 > - PrunedFilteredScan, since 1.3 > - InsertableRelation, since 1.3 > Keep the following experimental / evolving: > Data sources: > - CatalystScan (tied to internal logical plans so it is not stable by > definition) > Structured streaming: > - all classes (introduced new in 2.0 and will likely change) > Dataset typed operations (introduced in 1.6 and 2.0 and might change, > although probability is low) > - all typed methods on Dataset > - KeyValueGroupedDataset > - o.a.s.sql.expressions.javalang.typed > - o.a.s.sql.expressions.scalalang.typed > - methods that return typed Dataset in SparkSession -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17900) Mark the following Spark SQL APIs as stable
[ https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573090#comment-15573090 ] Apache Spark commented on SPARK-17900: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15469 > Mark the following Spark SQL APIs as stable > --- > > Key: SPARK-17900 > URL: https://issues.apache.org/jira/browse/SPARK-17900 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Mark the following stable: > Dataset/DataFrame > - functions, since 1.3 > - ColumnName, since 1.3 > - DataFrameNaFunctions, since 1.3.1 > - DataFrameStatFunctions, since 1.4 > - UserDefinedFunction, since 1.3 > - UserDefinedAggregateFunction, since 1.5 > - Window and WindowSpec, since 1.4 > Data sources: > - DataSourceRegister, since 1.5 > - RelationProvider, since 1.3 > - SchemaRelationProvider, since 1.3 > - CreatableRelationProvider, since 1.3 > - BaseRelation, since 1.3 > - TableScan, since 1.3 > - PrunedScan, since 1.3 > - PrunedFilteredScan, since 1.3 > - InsertableRelation, since 1.3 > Keep the following experimental / evolving: > Data sources: > - CatalystScan (tied to internal logical plans so it is not stable by > definition) > Structured streaming: > - all classes (introduced new in 2.0 and will likely change) > Dataset typed operations (introduced in 1.6 and 2.0 and might change, > although probability is low) > - all typed methods on Dataset > - KeyValueGroupedDataset > - o.a.s.sql.expressions.javalang.typed > - o.a.s.sql.expressions.scalalang.typed > - methods that return typed Dataset in SparkSession -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17900) Mark the following Spark SQL APIs as stable
[ https://issues.apache.org/jira/browse/SPARK-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17900: Assignee: Reynold Xin (was: Apache Spark) > Mark the following Spark SQL APIs as stable > --- > > Key: SPARK-17900 > URL: https://issues.apache.org/jira/browse/SPARK-17900 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Mark the following stable: > Dataset/DataFrame > - functions, since 1.3 > - ColumnName, since 1.3 > - DataFrameNaFunctions, since 1.3.1 > - DataFrameStatFunctions, since 1.4 > - UserDefinedFunction, since 1.3 > - UserDefinedAggregateFunction, since 1.5 > - Window and WindowSpec, since 1.4 > Data sources: > - DataSourceRegister, since 1.5 > - RelationProvider, since 1.3 > - SchemaRelationProvider, since 1.3 > - CreatableRelationProvider, since 1.3 > - BaseRelation, since 1.3 > - TableScan, since 1.3 > - PrunedScan, since 1.3 > - PrunedFilteredScan, since 1.3 > - InsertableRelation, since 1.3 > Keep the following experimental / evolving: > Data sources: > - CatalystScan (tied to internal logical plans so it is not stable by > definition) > Structured streaming: > - all classes (introduced new in 2.0 and will likely change) > Dataset typed operations (introduced in 1.6 and 2.0 and might change, > although probability is low) > - all typed methods on Dataset > - KeyValueGroupedDataset > - o.a.s.sql.expressions.javalang.typed > - o.a.s.sql.expressions.scalalang.typed > - methods that return typed Dataset in SparkSession -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573089#comment-15573089 ] Cody Koeninger commented on SPARK-17812: One other slightly ugly thing... {noformat} // starting topicpartitions, no explicit offset .option("assign", """{"topicfoo": [0, 1],"topicbar": [0, 1]}""" // do you allow specifying with explicit offsets in the same config option? // or force it all into startingOffsetForRealzYo? .option("assign", """{ "topicfoo" :{ "0": 1234, "1": 4567 }, "topicbar" : { "0": 1234, "1": 4567 }}""") {noformat} > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573056#comment-15573056 ] Dmytro Bielievtsov commented on SPARK-10872: [~sowen] Can you please give me some pointers in the source code so I can start working towards a pull request? > Derby error (XSDB6) when creating new HiveContext after restarting > SparkContext > --- > > Key: SPARK-10872 > URL: https://issues.apache.org/jira/browse/SPARK-10872 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Dmytro Bielievtsov > > Starting from spark 1.4.0 (works well on 1.3.1), the following code fails > with "XSDB6: Another instance of Derby may have already booted the database > ~/metastore_db": > {code:python} > from pyspark import SparkContext, HiveContext > sc = SparkContext("local[*]", "app1") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() > sc.stop() > sc = SparkContext("local[*]", "app2") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() # Py4J error > {code} > This is related to [#SPARK-9539], and I intend to restart spark context > several times for isolated jobs to prevent cache cluttering and GC errors. > Here's a larger part of the full error trace: > {noformat} > Failed to start database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > org.datanucleus.exceptions.NucleusDataStoreException: Failed to start > database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) > at java.security.AccessController.doPrivileged(Native Method) > at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) > at > javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) > at > org.apache.hadoop.hive.metastore.Hi
[jira] [Closed] (SPARK-15369) Investigate selectively using Jython for parts of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-15369. --- Resolution: Won't Fix In the spirit of having more explicitly accept/rejects, and given the discussions so far on both this ticket and on the github pull request), I'm going to close this as won't fix for now. We can still continue to discuss here on the merits, but the reject is based on the following: 1. Maintenance cost of supporting another runtime. 2. Jython is years behind in terms of features with Cython (or even PyPy). 3. Jython cannot leverage any of the numeric tools available. (In hindsight maybe PyPy support was also added prematurely.) > Investigate selectively using Jython for parts of PySpark > - > > Key: SPARK-15369 > URL: https://issues.apache.org/jira/browse/SPARK-15369 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Minor > > Transferring data from the JVM to the Python executor can be a substantial > bottleneck. While Jython is not suitable for all UDFs or map functions, it > may be suitable for some simple ones. We should investigate the option of > using Jython to accelerate these small functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem
[ https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572976#comment-15572976 ] Alessio edited comment on SPARK-15565 at 10/13/16 7:49 PM: --- Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this specific issue with the sentence "This was fixed in 2.0.0, as previous issues have reported". Although I noticed that SPARK-17918 was a duplicate of SPARK-17810 and I'm glad this will be fixed. was (Author: purple): Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this specific issue with the sentence "This was fixed in 2.0.0, as previous issues have reported". Although I noticed that my issue was a duplicate of SPARK-17810 and I'm glad this will be fixed. > The default value of spark.sql.warehouse.dir needs to explicitly point to > local filesystem > -- > > Key: SPARK-15565 > URL: https://issues.apache.org/jira/browse/SPARK-15565 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Critical > Fix For: 2.0.0 > > > The default value of {{spark.sql.warehouse.dir}} is > {{System.getProperty("user.dir")/warehouse}}. Since > {{System.getProperty("user.dir")}} is a local dir, we should explicitly set > the scheme to local filesystem. > This should be a one line change (at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58). > Also see > https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem
[ https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15572976#comment-15572976 ] Alessio commented on SPARK-15565: - Yes Sean, indeed in my latest issue SPARK-17918 I was referring to this specific issue with the sentence "This was fixed in 2.0.0, as previous issues have reported". Although I noticed that my issue was a duplicate of SPARK-17810 and I'm glad this will be fixed. > The default value of spark.sql.warehouse.dir needs to explicitly point to > local filesystem > -- > > Key: SPARK-15565 > URL: https://issues.apache.org/jira/browse/SPARK-15565 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Critical > Fix For: 2.0.0 > > > The default value of {{spark.sql.warehouse.dir}} is > {{System.getProperty("user.dir")/warehouse}}. Since > {{System.getProperty("user.dir")}} is a local dir, we should explicitly set > the scheme to local filesystem. > This should be a one line change (at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58). > Also see > https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17917) Convert 'Initial job has not accepted any resources..' logWarning to a SparkListener event
[ https://issues.apache.org/jira/browse/SPARK-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17917: -- Priority: Minor (was: Major) Maybe, I suppose it will be a little tricky to define what the event is here, since the event is that something didn't happen. Still, whatever is triggering the log might reasonably trigger an event. I don't have a strong feeling on this partly because I'm not sure what the action then is -- kill the job? > Convert 'Initial job has not accepted any resources..' logWarning to a > SparkListener event > -- > > Key: SPARK-17917 > URL: https://issues.apache.org/jira/browse/SPARK-17917 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Mario Briggs >Priority: Minor > > When supporting Spark on a multi-tenant shared large cluster with quotas per > tenant, often a submitted taskSet might not get executors because quotas have > been exhausted (or) resources unavailable. In these situations, firing a > SparkListener event instead of just logging the issue (as done currently at > https://github.com/apache/spark/blob/9216901d52c9c763bfb908013587dcf5e781f15b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L192), > would give applications/listeners an opportunity to handle this more > appropriately as needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org