Re: Apache ORC 1.7.0 QA status

Dongjoon Hyun Mon, 26 Jul 2021 22:47:34 -0700

Thank you all for investigating this! :)

Dongjoon.


On Mon, Jul 26, 2021 at 1:11 PM Kyle Bendickson
<[email protected]> wrote:

> Hi @Pavan @Panos,
>
> In case that link isn’t working, here’s the stack trace from the NPE
> during the specific test. I was trying to get my debugger to step through
> it, but I’m having issues with IntelliJ using the correct sources when I
> use the debugger. I might have to nuke all of my gradle dependencies and
> then start over.
>
> Here’s a link to the specific code for the test that is throwing the NPE <
> https://github.com/kbendick/iceberg/blob/b58d9064b2839989d15b9e0ca0192eb077367c33/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkOrcReaderWriter.java#L81-L92
> >
>
> And here’s a link to the actual test in the super class that is causing
> it. <
> https://github.com/kbendick/iceberg/blob/b58d9064b2839989d15b9e0ca0192eb077367c33/data/src/test/java/org/apache/iceberg/data/DataTest.java#L134-L171
> >
>
> If you want to test it out yourself, you can clone my repo, check out the
> `upgrade-orc-to-170-test` branch, and then run the test.
>
> $ git clone [email protected]:kbendick/iceberg.git && cd iceberg && git
> checkout upgrade-orc-to-170-test && ./gradlew iceberg-flink:test
>
> I will try to get my debugger working so I can actually further
> investigate. The test does fail in CLIl, but my debugger environment is
> having issues and using older sources.
>
> Thanks, Kyle
>
> Stack trace of failure:
>
> org.apache.iceberg.flink.data.TestFlinkOrcReaderWriter > testMixedTypes
> FAILED
> 7550
>  <
> https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7550>
>   java.lang.NullPointerException
> 7551
>  <
> https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7551>
>       at
> org.apache.orc.impl.PhysicalFsWriter.writeFileMetadata(PhysicalFsWriter.java:416)
> 7552
>  <
> https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7552>
>       at org.apache.orc.impl.WriterImpl.writeMetadata(WriterImpl.java:575)
> 7553
>  <
> https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7553>
>       at org.apache.orc.impl.WriterImpl.writeFooter(WriterImpl.java:640)
> 7554
>  <
> https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7554>
>       at org.apache.orc.impl.WriterImpl.close(WriterImpl.java:732)
> 7555
>  <
> https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7555>
>       at
> org.apache.iceberg.orc.OrcFileAppender.close(OrcFileAppender.java:127)
> 7556
>  <
> https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7556>
>       at
> org.apache.iceberg.flink.data.TestFlinkOrcReaderWriter.writeAndValidate(TestFlinkOrcReaderWriter.java:61)
>
>
> 
>
> Kyle Bendickson
> Software Engineer
> Apple
> ACS Data
> One Apple Park Way,
> Cupertino, CA 95014, USA
> [email protected]
>
> This email and any attachments may be privileged and may contain
> confidential information intended only for the recipient(s) named above.
> Any other distribution, forwarding, copying or disclosure of this message
> is strictly prohibited. If you have received this email in error, please
> notify me immediately by telephone or return email, and delete this message
> from your system.
>
>
> > On Jul 26, 2021, at 9:18 AM, Pavan Lanka <[email protected]>
> wrote:
> >
> > @Panos,
> >
> > Here is the link to the specific failure for NPE.
> > https://github.com/kbendick/iceberg/runs/3156762084#step:6:7499
> >
> > I will also try to take a look at this today to see what is causing this.
> >
> > Regards,
> > Pavan
> >
> >> On Jul 26, 2021, at 8:30 AM, Panos Garefalakis <[email protected]>
> wrote:
> >>
> >> Thanks both for keeping us posted!
> >> Seems like we are going to need another Storage-api release to fix the
> >> dependency issue that was discovered recerntly.
> >> @Kyle Regarding the HT dictionary regression -- seems to be quite
> critical.
> >> Can you please point to the actual failing test? I could only find build
> >> failures in the logs..
> >> Regarding the JMH benchmarks, I am a happy user myself --
> >> *ORCWriterBenchMark* was created for this very reason.
> >> Maybe it would make sense to extend that for nested schemas as described
> >> above?
> >>
> >> Cheers,
> >> Panagiotis
> >>
> >>
> >> On Mon, Jul 26, 2021 at 9:21 AM Kyle Bendickson
> >> <[email protected]> wrote:
> >>
> >>> Thanks for the summary of QA for 1.7.0 Dongjoon =)
> >>>
> >>> Wanted to follow up on a few things I’ve come across testing several
> >>> different versions of ORC 1.7.0-SNAPSHOT with Apache Iceberg, in case
> >>> others have insight into the issues I’ve faced:
> >>>
> >>> 1) I wanted to mention that the NPE in the Iceberg tests for
> >>> 1.7.0-SNAPSHOT did not occur for one of the earlier versions of the
> >>> snapshot 1.7.0-SNAPSHOT.  I am still working to determine which
> version it
> >>> was working with, but here’s a link <
> >>> https://github.com/kbendick/iceberg/pull/42>[1] to the branch / PR
> where
> >>> I am experiencing the NPE in a Flink ORC read/write test suite when
> using a
> >>> large mixed type / nested schema if others want to see the failed test
> >>> suites (we use GitHub Actions so it should be familiar).
> >>>
> >>> The NPE occurs when calling `close` and attempting to write the file
> >>> metadata. This occurs when calling
> >>> org.apache.orc.impl.PhysicalFsWriter#writeFileMetadata, specifically
> when
> >>> calling getPos on the private field `rawWriter` of type
> FSDataOututStream,
> >>> so it’s possible that there is some potential configuration issues when
> >>> setting the filesystem for the writer (e.g. in the Iceberg code).
> However
> >>> several other tests pass, about 5 in total all of which run this exact
> same
> >>> test code using different schemas, so I’m inclined to think there’s
> some
> >>> sort of issue with where we read / write to the output stream that is
> >>> surfacing with a large number of serialized columns or something. I am
> >>> still investigating and will report as soon as I know more but wanted
> to
> >>> bring it to everyone’s attention =)
> >>>
> >>> 2) For the same test which reads and writes 1000 records to a file
> using
> >>> Flink (though mostly via Iceberg API) with a large / nested schema, ,we
> >>> have to change the writer’s configuration to use the old default value
> of
> >>> `orc.dictionary.implementation`, i.e. `rbtree` instead of the new
> default
> >>> value of `hash`. When using `hash`, there is an OOM coming from the
> >>> relatively recently added `StringHashTableDictionary`. I noticed there
> is a
> >>> comment about not needing large buckets, as very many collisions would
> mean
> >>> that the "table is too small or the function isn’t good”. <
> >>>
> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L84-L87
> >[2]
> >>> This test is specifically for testing large and nested schemas (there’s
> >>> over 20 columns with map and list types). But this data isn’t
> necessarily
> >>> so far outside of the realm of realworld datasets.
> >>>
> >>> Are there any existing performance tests using the new hash based
> >>> dictionary implementation or any guidance we have for users for
> configuring
> >>> their files (strips, etc) to make effective use of the new hash
> >>> implementation? I’m wondering if users will have a difficult time
> >>> transitioning due to having “a function that isn’t good or a table
> that’s
> >>> too small"
> >>>
> >>> [1] Github PR with currently failing Iceberg Flink tests for ORC
> >>> 1.7.0-SNAPSHOT: https://github.com/kbendick/iceberg/pull/42 <
> >>> https://github.com/kbendick/iceberg/pull/42>
> >>> [2] ORC StringHashTableDictionary class, with hash bucket instantiation
> >>> with comments about bucket size w.r.t collisions:
> >>>
> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89
> >>> <
> >>>
> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89
> >>>>
> >>>
> >>> Thanks for summarizing the on going work and please let me know if I
> can
> >>> be of any help outside of the current investigation into the Iceberg
> >>> issues. I’d love to possibly help set up JMH benchmark testing (which
> is
> >>> admittedly not perfect), as I know we have had several PRs recently to
> >>> optimize some code paths. Would be great to have a way to measure the
> >>> impact somewhat (while recognizing that JMH isn’t exactly perfect).
> >>>
> >>> Best,
> >>> Kyle Bendickson
> >>>
> >>>
> >>> On 2021/07/20 20:33:14, Dongjoon Hyun <[email protected]> wrote:
> >>>> Thank you all for your participation.>
> >>>>
> >>>> Here is a brief summary of the recent ORC community activity for
> Apache
> >>> ORC>
> >>>> 1.7.0 QA.>
> >>>>
> >>>> 1. After branching branch-1.7, I shared Apache Iceberg integration
> test>
> >>>> failure.>
> >>>> 2. Kyle Bendickson reported the same issue. In addition, when he
> >>> reverted>
> >>>> to rbtree, he met NPE in some cases. He is investigating it.>
> >>>> 3. David Mollitor has been working on several nice code refactoring
> and>
> >>>> perf improvement. Currently, his patches are landed on the `main`
> >>> branch>
> >>>> for 1.8.0, but I'm considering backporting them to branch-1.7 if
> >>> needed.>
> >>>> 4. William (Jeongseok) Hyun also found an orc-tools CNFE bug and fixed
> >>> it.>
> >>>> 5. Yukihiro Okada made a patch to fix his `head` command contribution
> >>> and>
> >>>> Panos/William reviewed and merged.>
> >>>>
> >>>> In addition to the above, Apache ORC community is waiting for a new
> >>> hive>
> >>>> storage api release.>
> >>>>
> >>>> Yesterday, Panos started the Hive Storage API 2.8.0-rc0 vote and
> Pavan>
> >>>> Lanka, Owen and I voted.>
> >>>> ->
> >>>>
> >>>
> https://lists.apache.org/thread.html/r28201af1d35c67c72d3e846c26c77c38d2a51437f0b1bd31ddcf12b8%40%3Cdev.hive.apache.org%3E
> >
> >>>
> >>>>
> >>>> Apache ORC 1.7.0 is coming! Please test it in your environments and
> give
> >>> us>
> >>>> your feedback.>
> >>>>
> >>>> Best,>
> >>>> Dongjoon.>
> >>>>
> >>>
> >>> 
> >>>
> >>> Kyle Bendickson
> >>> Software Engineer
> >>> Apple
> >>> ACS Data
> >>> One Apple Park Way,
> >>> Cupertino, CA 95014, USA
> >>> [email protected]
> >>>
> >>> This email and any attachments may be privileged and may contain
> >>> confidential information intended only for the recipient(s) named
> above.
> >>> Any other distribution, forwarding, copying or disclosure of this
> message
> >>> is strictly prohibited. If you have received this email in error,
> please
> >>> notify me immediately by telephone or return email, and delete this
> message
> >>> from your system.
> >>>
> >>>
> >>>
> >
>
>

Re: Apache ORC 1.7.0 QA status

Reply via email to