ColumnnarBatch to InternalRow Cast exception with codegen enabled.

2020-06-11 Thread Nasrulla Khan Haris
HI Spark developer,

I have a new baseRelation which Initializes ParquetFileFormat object and when 
reading the data I am encountering Cast Exception below, however when I disable 
codegen support with config "spark.sql.codegen.wholeStage"= false, I do not 
encounter this exception.


20/06/11 17:35:39 INFO FileScanRDD: Reading File path: file:///D:/ 
jvm/src/test/scala/resources/pems_sorted/station=402260/part-r-00245-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet,
 range: 0-50936, partition values: [402260]
20/06/11 17:35:39 INFO CodecPool: Got brand-new decompressor [.snappy]
20/06/11 17:35:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch 
cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Appreciate your inputs.

Thanks,
NKH


Re: Quick sync: what goes in migration guide vs release notes?

2020-06-11 Thread Wenchen Fan
How about we treat the migration guide as a part of the release notes? e.g.
in the "breaking changes" section, just point to the migration guide.

Thus, we don't change what to put in the release notes, but just move the
breaking changes to a sub-doc.

On Thu, Jun 11, 2020 at 11:13 AM Sean Owen  wrote:

> If you are proposing to keep all important changes in release notes as
> usual:
> Sure, add more to the migration guide, too. It can't hurt, going forward.
> But many release-note changes have not much more to say about migration.
>
> If you're proposing to not mention most things in the release notes, which
> is a change:
> OK, are the migration guides the new release notes? propose a change to
> the process I guess?
>
> But, what does this accomplish? Shorter release notes = longer migration
> guide. Is that saving time?
> Why is compiling the release notes a big deal, assuming the JIRAs have
> "Docs text" snippets to compile? OK, I'll buy the idea that it's better to
> compile them along the way rather than make the RM pull them from JIRA.
>
> If the goal is a TL;DR summary of major changes, neither of these
> accomplishes that. That's valuable, but is what a summary blog is for.
>
> I can't feel strongly about this, so, would just say, propose process
> changes for 3.1 and codify in the contributing guide but stick with what we
> have for 3.0.
>
>
> On Wed, Jun 10, 2020 at 10:04 PM Wenchen Fan  wrote:
>
>> Yea we can't update the 3.0.0 migration guide now, but AFAIK we do
>> mention most of the breaking changes there, except for the ML module. I
>> think we can still put all the ML breaking changes in the release notes
>> this time. But in the future, shall we put breaking changes in the
>> migration guide? It also saves the release manager's time to write the
>> release notes.
>>
>> On Thu, Jun 11, 2020 at 9:53 AM Hyukjin Kwon  wrote:
>>
>>> I think the proposal doesn't mean to don't add the JIRAs with
>>> release-notes into the release notes (?).
>>> People will still label the JIRAs when the change is significant or
>>> breaking whether it's a bug or not, and they will be in the release notes.
>>> I guess the proposal TL;DR is:
>>>   - If that's a legitimate breaking improvement, it goes to migration
>>> guides.
>>>   - If the change is significant or breaking whether it's a bug or not,
>>> we label JIRA with release-notes, and add it into the release note.
>>>
>>>
>>>
>>> But yeah since we're here, I guess it's better to articulate and
>>> document them.
>>>
>>>
>>>
>>> 2020년 6월 11일 (목) 오전 12:39, Sean Owen 님이 작성:
>>>
 This seems like a change to current practice, as breaking changes are
 marked for release notes with release-notes, etc:
 https://spark.apache.org/contributing.html
 My only concrete concern, is this seems to imply (?) that many JIRAs
 with release-notes and Docs text are not going to get included in release
 notes. They aren't anywhere then (3.0 is done, so not the migration guide).
 Some are important.
 Change could be OK but how about proposing this going forward?


 On Wed, Jun 10, 2020 at 10:35 AM Wenchen Fan 
 wrote:

> My 2 cents:
>
> Since we have a migration guide, I think people who hit problems when
> upgrading Spark will read it. We should mention all the breaking changes
> there, except for trivial ones like obvious bug fixes. Even if there is no
> meaningful migration to guide for things like removing a deprecated API,
> it's still useful to have an item in the migration guide, to explain why 
> we
> remove it.
>
> Release notes, on the other hand, should include all the major things,
> like new features, improvements, new APIs, bug fixes, breaking changes,
> etc., as long as they are major. For example, dropping Scala 2.11 is a
> major breaking change and should be put in the release notes. That said, 
> we
> may have some items in both the migration guide and the release notes.
>
> Release notes can be read by many people and I think it's better to
> not make it too verbose.
>
>
>
> On Tue, Jun 9, 2020 at 10:33 PM Sean Owen  wrote:
>
>> A few different takes surfaced:
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-26043?focusedCommentId=17128908=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17128908
>>
>> No significant disagreements, just might be worth clarifying a
>> consensus policy.
>>
>> "I feel this is a tiny thing that we should put into the migration
>> guide, not release notes? ... it depends on the definition of migration
>> guide and release notes: If I upgrade to 3.0 and hit compiler error, 
>> which
>> one should I read?"
>>
>> "I think it's the other way around: some things are worth noting, but
>> there is no meaningful migration to guide. So they go in release notes, 
>> not
>> a migration guide, if