Thank you all for investigating this! :) Dongjoon.
On Mon, Jul 26, 2021 at 1:11 PM Kyle Bendickson <[email protected]> wrote: > Hi @Pavan @Panos, > > In case that link isn’t working, here’s the stack trace from the NPE > during the specific test. I was trying to get my debugger to step through > it, but I’m having issues with IntelliJ using the correct sources when I > use the debugger. I might have to nuke all of my gradle dependencies and > then start over. > > Here’s a link to the specific code for the test that is throwing the NPE < > https://github.com/kbendick/iceberg/blob/b58d9064b2839989d15b9e0ca0192eb077367c33/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkOrcReaderWriter.java#L81-L92 > > > > And here’s a link to the actual test in the super class that is causing > it. < > https://github.com/kbendick/iceberg/blob/b58d9064b2839989d15b9e0ca0192eb077367c33/data/src/test/java/org/apache/iceberg/data/DataTest.java#L134-L171 > > > > If you want to test it out yourself, you can clone my repo, check out the > `upgrade-orc-to-170-test` branch, and then run the test. > > $ git clone [email protected]:kbendick/iceberg.git && cd iceberg && git > checkout upgrade-orc-to-170-test && ./gradlew iceberg-flink:test > > I will try to get my debugger working so I can actually further > investigate. The test does fail in CLIl, but my debugger environment is > having issues and using older sources. > > Thanks, Kyle > > Stack trace of failure: > > org.apache.iceberg.flink.data.TestFlinkOrcReaderWriter > testMixedTypes > FAILED > 7550 > < > https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7550> > java.lang.NullPointerException > 7551 > < > https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7551> > at > org.apache.orc.impl.PhysicalFsWriter.writeFileMetadata(PhysicalFsWriter.java:416) > 7552 > < > https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7552> > at org.apache.orc.impl.WriterImpl.writeMetadata(WriterImpl.java:575) > 7553 > < > https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7553> > at org.apache.orc.impl.WriterImpl.writeFooter(WriterImpl.java:640) > 7554 > < > https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7554> > at org.apache.orc.impl.WriterImpl.close(WriterImpl.java:732) > 7555 > < > https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7555> > at > org.apache.iceberg.orc.OrcFileAppender.close(OrcFileAppender.java:127) > 7556 > < > https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7556> > at > org.apache.iceberg.flink.data.TestFlinkOrcReaderWriter.writeAndValidate(TestFlinkOrcReaderWriter.java:61) > > > > > Kyle Bendickson > Software Engineer > Apple > ACS Data > One Apple Park Way, > Cupertino, CA 95014, USA > [email protected] > > This email and any attachments may be privileged and may contain > confidential information intended only for the recipient(s) named above. > Any other distribution, forwarding, copying or disclosure of this message > is strictly prohibited. If you have received this email in error, please > notify me immediately by telephone or return email, and delete this message > from your system. > > > > On Jul 26, 2021, at 9:18 AM, Pavan Lanka <[email protected]> > wrote: > > > > @Panos, > > > > Here is the link to the specific failure for NPE. > > https://github.com/kbendick/iceberg/runs/3156762084#step:6:7499 > > > > I will also try to take a look at this today to see what is causing this. > > > > Regards, > > Pavan > > > >> On Jul 26, 2021, at 8:30 AM, Panos Garefalakis <[email protected]> > wrote: > >> > >> Thanks both for keeping us posted! > >> Seems like we are going to need another Storage-api release to fix the > >> dependency issue that was discovered recerntly. > >> @Kyle Regarding the HT dictionary regression -- seems to be quite > critical. > >> Can you please point to the actual failing test? I could only find build > >> failures in the logs.. > >> Regarding the JMH benchmarks, I am a happy user myself -- > >> *ORCWriterBenchMark* was created for this very reason. > >> Maybe it would make sense to extend that for nested schemas as described > >> above? > >> > >> Cheers, > >> Panagiotis > >> > >> > >> On Mon, Jul 26, 2021 at 9:21 AM Kyle Bendickson > >> <[email protected]> wrote: > >> > >>> Thanks for the summary of QA for 1.7.0 Dongjoon =) > >>> > >>> Wanted to follow up on a few things I’ve come across testing several > >>> different versions of ORC 1.7.0-SNAPSHOT with Apache Iceberg, in case > >>> others have insight into the issues I’ve faced: > >>> > >>> 1) I wanted to mention that the NPE in the Iceberg tests for > >>> 1.7.0-SNAPSHOT did not occur for one of the earlier versions of the > >>> snapshot 1.7.0-SNAPSHOT. I am still working to determine which > version it > >>> was working with, but here’s a link < > >>> https://github.com/kbendick/iceberg/pull/42>[1] to the branch / PR > where > >>> I am experiencing the NPE in a Flink ORC read/write test suite when > using a > >>> large mixed type / nested schema if others want to see the failed test > >>> suites (we use GitHub Actions so it should be familiar). > >>> > >>> The NPE occurs when calling `close` and attempting to write the file > >>> metadata. This occurs when calling > >>> org.apache.orc.impl.PhysicalFsWriter#writeFileMetadata, specifically > when > >>> calling getPos on the private field `rawWriter` of type > FSDataOututStream, > >>> so it’s possible that there is some potential configuration issues when > >>> setting the filesystem for the writer (e.g. in the Iceberg code). > However > >>> several other tests pass, about 5 in total all of which run this exact > same > >>> test code using different schemas, so I’m inclined to think there’s > some > >>> sort of issue with where we read / write to the output stream that is > >>> surfacing with a large number of serialized columns or something. I am > >>> still investigating and will report as soon as I know more but wanted > to > >>> bring it to everyone’s attention =) > >>> > >>> 2) For the same test which reads and writes 1000 records to a file > using > >>> Flink (though mostly via Iceberg API) with a large / nested schema, ,we > >>> have to change the writer’s configuration to use the old default value > of > >>> `orc.dictionary.implementation`, i.e. `rbtree` instead of the new > default > >>> value of `hash`. When using `hash`, there is an OOM coming from the > >>> relatively recently added `StringHashTableDictionary`. I noticed there > is a > >>> comment about not needing large buckets, as very many collisions would > mean > >>> that the "table is too small or the function isn’t good”. < > >>> > https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L84-L87 > >[2] > >>> This test is specifically for testing large and nested schemas (there’s > >>> over 20 columns with map and list types). But this data isn’t > necessarily > >>> so far outside of the realm of realworld datasets. > >>> > >>> Are there any existing performance tests using the new hash based > >>> dictionary implementation or any guidance we have for users for > configuring > >>> their files (strips, etc) to make effective use of the new hash > >>> implementation? I’m wondering if users will have a difficult time > >>> transitioning due to having “a function that isn’t good or a table > that’s > >>> too small" > >>> > >>> [1] Github PR with currently failing Iceberg Flink tests for ORC > >>> 1.7.0-SNAPSHOT: https://github.com/kbendick/iceberg/pull/42 < > >>> https://github.com/kbendick/iceberg/pull/42> > >>> [2] ORC StringHashTableDictionary class, with hash bucket instantiation > >>> with comments about bucket size w.r.t collisions: > >>> > https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89 > >>> < > >>> > https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89 > >>>> > >>> > >>> Thanks for summarizing the on going work and please let me know if I > can > >>> be of any help outside of the current investigation into the Iceberg > >>> issues. I’d love to possibly help set up JMH benchmark testing (which > is > >>> admittedly not perfect), as I know we have had several PRs recently to > >>> optimize some code paths. Would be great to have a way to measure the > >>> impact somewhat (while recognizing that JMH isn’t exactly perfect). > >>> > >>> Best, > >>> Kyle Bendickson > >>> > >>> > >>> On 2021/07/20 20:33:14, Dongjoon Hyun <[email protected]> wrote: > >>>> Thank you all for your participation.> > >>>> > >>>> Here is a brief summary of the recent ORC community activity for > Apache > >>> ORC> > >>>> 1.7.0 QA.> > >>>> > >>>> 1. After branching branch-1.7, I shared Apache Iceberg integration > test> > >>>> failure.> > >>>> 2. Kyle Bendickson reported the same issue. In addition, when he > >>> reverted> > >>>> to rbtree, he met NPE in some cases. He is investigating it.> > >>>> 3. David Mollitor has been working on several nice code refactoring > and> > >>>> perf improvement. Currently, his patches are landed on the `main` > >>> branch> > >>>> for 1.8.0, but I'm considering backporting them to branch-1.7 if > >>> needed.> > >>>> 4. William (Jeongseok) Hyun also found an orc-tools CNFE bug and fixed > >>> it.> > >>>> 5. Yukihiro Okada made a patch to fix his `head` command contribution > >>> and> > >>>> Panos/William reviewed and merged.> > >>>> > >>>> In addition to the above, Apache ORC community is waiting for a new > >>> hive> > >>>> storage api release.> > >>>> > >>>> Yesterday, Panos started the Hive Storage API 2.8.0-rc0 vote and > Pavan> > >>>> Lanka, Owen and I voted.> > >>>> -> > >>>> > >>> > https://lists.apache.org/thread.html/r28201af1d35c67c72d3e846c26c77c38d2a51437f0b1bd31ddcf12b8%40%3Cdev.hive.apache.org%3E > > > >>> > >>>> > >>>> Apache ORC 1.7.0 is coming! Please test it in your environments and > give > >>> us> > >>>> your feedback.> > >>>> > >>>> Best,> > >>>> Dongjoon.> > >>>> > >>> > >>> > >>> > >>> Kyle Bendickson > >>> Software Engineer > >>> Apple > >>> ACS Data > >>> One Apple Park Way, > >>> Cupertino, CA 95014, USA > >>> [email protected] > >>> > >>> This email and any attachments may be privileged and may contain > >>> confidential information intended only for the recipient(s) named > above. > >>> Any other distribution, forwarding, copying or disclosure of this > message > >>> is strictly prohibited. If you have received this email in error, > please > >>> notify me immediately by telephone or return email, and delete this > message > >>> from your system. > >>> > >>> > >>> > > > >
